Columnar Operations
High-performance columnar storage and operations in Stoolap
Columnar Operations
Stoolap uses columnar storage for CTEs and analytical queries, providing significant performance and memory efficiency improvements.
Overview
Columnar storage organizes data by columns rather than rows, offering several advantages:
- Better compression ratios
- Improved cache locality for analytical queries
- Efficient aggregation operations
- Reduced memory allocations
Memory Efficiency
Our columnar implementation achieves dramatic memory reductions:
-- Benchmark: 10,000 row CTE with multiple aggregations
-- Before: 16.7GB allocations
-- After: 17MB allocations (99.90% reduction)
WITH large_dataset AS (
SELECT id, category, value
FROM generate_series(1, 10000) AS id
CROSS JOIN categories
)
SELECT category, COUNT(*), SUM(value), AVG(value)
FROM large_dataset
GROUP BY category;
Columnar Result Storage
When CTEs are materialized, they use ColumnarResult
which stores data in column arrays:
// Traditional row storage (inefficient)
rows := []map[string]ColumnValue{
{"id": 1, "name": "Alice", "score": 95},
{"id": 2, "name": "Bob", "score": 87},
}
// Columnar storage (efficient)
columns := map[string][]ColumnValue{
"id": {1, 2},
"name": {"Alice", "Bob"},
"score": {95, 87},
}
Optimized Aggregations
Columnar storage enables efficient aggregation operations:
Supported Columnar Operations
COUNT(*)
- Direct row count without iterationCOUNT(column)
- Non-null count with columnar scanCOUNT(DISTINCT column)
- Distinct values using hash setMIN(column)
- Single column scan for minimumMAX(column)
- Single column scan for maximumSUM(column)
- Columnar addition with type optimizationAVG(column)
- Combined sum and count operation
Example Performance
-- Traditional row-based execution
WITH sales_data AS (
SELECT product_id, amount FROM sales
)
SELECT
COUNT(*) as total_sales,
SUM(amount) as revenue,
AVG(amount) as avg_sale
FROM sales_data;
-- Time: 250ms for 1M rows
-- Columnar execution (automatic for CTEs)
-- Time: 45ms for 1M rows (5.5x faster)
Hash-Based IN Subqueries
Subqueries using IN/NOT IN are optimized with hash tables:
-- O(n×m) with array lookup (slow)
DELETE FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE country = 'US'
);
-- O(n) with hash lookup (fast)
-- Automatically optimized by Stoolap
Performance improvements:
- Small lists (10 items): 293x faster
- Medium lists (100 items): 857x faster
- Large lists (1000 items): 2048x faster
Array-Based Row Storage
Throughout the query executor, we use arrays instead of maps for row storage:
-- Efficient array-based storage for:
- JOIN operations (HashJoinResult)
- GROUP BY aggregations (ArrayAggregateResult)
- HAVING clause evaluation
- Projected results
-- Result: 99.90% reduction in memory allocations
Best Practices
Use CTEs for Large Datasets
-- Good: CTE materializes once with columnar storage
WITH filtered_data AS (
SELECT * FROM large_table WHERE category = 'A'
)
SELECT COUNT(*), AVG(value) FROM filtered_data;
-- Less efficient: Subquery may execute multiple times
SELECT COUNT(*), AVG(value)
FROM (SELECT * FROM large_table WHERE category = 'A') t;
Leverage Columnar Aggregations
-- Efficient: Direct columnar operations
WITH summary AS (
SELECT region, product, sales_amount
FROM sales_facts
)
SELECT
region,
COUNT(*) as transactions,
SUM(sales_amount) as total,
AVG(sales_amount) as average
FROM summary
GROUP BY region;
-- These aggregations run directly on columnar data
Optimize IN Subqueries
-- Good: Subquery result is hashed
UPDATE products
SET discontinued = true
WHERE id IN (
SELECT product_id
FROM inventory
WHERE quantity = 0 AND last_sale < '2023-01-01'
);
-- Even better for static lists: Use VALUES
UPDATE products
SET discontinued = true
WHERE id IN (VALUES (1), (2), (3), (4), (5));
Implementation Details
ColumnarResult Structure
The ColumnarResult
type stores data efficiently:
- Column names array for fast lookup
- Parallel column arrays for data storage
- Reusable row buffer (with proper isolation for aggregations)
- O(1) column access by name or index
Memory Management
- Pre-allocated column arrays with growth strategy
- Object pooling for temporary allocations
- Automatic garbage collection of old versions
- Zero-copy operations where possible
Future Optimizations
Planned improvements include:
- SIMD acceleration for columnar operations
- Compressed columnar storage
- Predicate pushdown for columnar filtering
- Parallel aggregation execution
- Adaptive query execution based on data characteristics