Columnar Operations

High-performance columnar storage and operations in Stoolap

Columnar Operations

Stoolap uses columnar storage for CTEs and analytical queries, providing significant performance and memory efficiency improvements.

Overview

Columnar storage organizes data by columns rather than rows, offering several advantages:

  • Better compression ratios
  • Improved cache locality for analytical queries
  • Efficient aggregation operations
  • Reduced memory allocations

Memory Efficiency

Our columnar implementation achieves dramatic memory reductions:

-- Benchmark: 10,000 row CTE with multiple aggregations
-- Before: 16.7GB allocations
-- After: 17MB allocations (99.90% reduction)

WITH large_dataset AS (
    SELECT id, category, value
    FROM generate_series(1, 10000) AS id
    CROSS JOIN categories
)
SELECT category, COUNT(*), SUM(value), AVG(value)
FROM large_dataset
GROUP BY category;

Columnar Result Storage

When CTEs are materialized, they use ColumnarResult which stores data in column arrays:

// Traditional row storage (inefficient)
rows := []map[string]ColumnValue{
    {"id": 1, "name": "Alice", "score": 95},
    {"id": 2, "name": "Bob", "score": 87},
}

// Columnar storage (efficient)
columns := map[string][]ColumnValue{
    "id":    {1, 2},
    "name":  {"Alice", "Bob"},
    "score": {95, 87},
}

Optimized Aggregations

Columnar storage enables efficient aggregation operations:

Supported Columnar Operations

  • COUNT(*) - Direct row count without iteration
  • COUNT(column) - Non-null count with columnar scan
  • COUNT(DISTINCT column) - Distinct values using hash set
  • MIN(column) - Single column scan for minimum
  • MAX(column) - Single column scan for maximum
  • SUM(column) - Columnar addition with type optimization
  • AVG(column) - Combined sum and count operation

Example Performance

-- Traditional row-based execution
WITH sales_data AS (
    SELECT product_id, amount FROM sales
)
SELECT 
    COUNT(*) as total_sales,
    SUM(amount) as revenue,
    AVG(amount) as avg_sale
FROM sales_data;
-- Time: 250ms for 1M rows

-- Columnar execution (automatic for CTEs)
-- Time: 45ms for 1M rows (5.5x faster)

Hash-Based IN Subqueries

Subqueries using IN/NOT IN are optimized with hash tables:

-- O(n×m) with array lookup (slow)
DELETE FROM orders 
WHERE customer_id IN (
    SELECT id FROM customers WHERE country = 'US'
);

-- O(n) with hash lookup (fast)
-- Automatically optimized by Stoolap

Performance improvements:

  • Small lists (10 items): 293x faster
  • Medium lists (100 items): 857x faster
  • Large lists (1000 items): 2048x faster

Array-Based Row Storage

Throughout the query executor, we use arrays instead of maps for row storage:

-- Efficient array-based storage for:
- JOIN operations (HashJoinResult)
- GROUP BY aggregations (ArrayAggregateResult)
- HAVING clause evaluation
- Projected results

-- Result: 99.90% reduction in memory allocations

Best Practices

Use CTEs for Large Datasets

-- Good: CTE materializes once with columnar storage
WITH filtered_data AS (
    SELECT * FROM large_table WHERE category = 'A'
)
SELECT COUNT(*), AVG(value) FROM filtered_data;

-- Less efficient: Subquery may execute multiple times
SELECT COUNT(*), AVG(value) 
FROM (SELECT * FROM large_table WHERE category = 'A') t;

Leverage Columnar Aggregations

-- Efficient: Direct columnar operations
WITH summary AS (
    SELECT region, product, sales_amount
    FROM sales_facts
)
SELECT 
    region,
    COUNT(*) as transactions,
    SUM(sales_amount) as total,
    AVG(sales_amount) as average
FROM summary
GROUP BY region;

-- These aggregations run directly on columnar data

Optimize IN Subqueries

-- Good: Subquery result is hashed
UPDATE products 
SET discontinued = true
WHERE id IN (
    SELECT product_id 
    FROM inventory 
    WHERE quantity = 0 AND last_sale < '2023-01-01'
);

-- Even better for static lists: Use VALUES
UPDATE products
SET discontinued = true  
WHERE id IN (VALUES (1), (2), (3), (4), (5));

Implementation Details

ColumnarResult Structure

The ColumnarResult type stores data efficiently:

  • Column names array for fast lookup
  • Parallel column arrays for data storage
  • Reusable row buffer (with proper isolation for aggregations)
  • O(1) column access by name or index

Memory Management

  • Pre-allocated column arrays with growth strategy
  • Object pooling for temporary allocations
  • Automatic garbage collection of old versions
  • Zero-copy operations where possible

Future Optimizations

Planned improvements include:

  • SIMD acceleration for columnar operations
  • Compressed columnar storage
  • Predicate pushdown for columnar filtering
  • Parallel aggregation execution
  • Adaptive query execution based on data characteristics