Vectorized Execution
Vectorized Execution in Stoolap
Overview
Stoolap’s vectorized execution engine is designed to accelerate SQL query processing by operating on data in columnar batches rather than row-by-row, allowing for better CPU utilization and cache efficiency. This document explains the architecture and components of the vectorized execution system.
Key Concepts
Batch Processing
The vectorized execution model processes data in batches to improve memory locality and enable SIMD (Single Instruction, Multiple Data) operations. Instead of processing one row at a time, the engine processes multiple rows simultaneously using array-based operations.
Key characteristics:
- Default batch size: 32,768 rows (
MaxBatchSize = 1024 * 32
) - Parallelization threshold: 5,000 rows (
ParallelThreshold
) - Optimal processing block: 64 elements (
OptimalBlockSize
)
Columnar Batch Structure
For efficient vector processing, data from the row-based storage is organized into columnar batches:
type Batch struct {
// Size is the number of rows in the batch
Size int
// Column arrays organized by data type for efficient processing
IntColumns map[string][]int64
FloatColumns map[string][]float64
StringColumns map[string][]string
BoolColumns map[string][]bool
TimeColumns map[string][]time.Time
// Null indicators for each column (true = NULL)
NullBitmaps map[string][]bool
// List of column names in the batch (for ordered processing)
ColumnNames []string
}
Memory Management
The engine uses object pools to minimize garbage collection overhead:
- Batch pool for reusing batch structures
- Type-specific buffer pools for column arrays
- Selection vector pool for filtered data operations
Architecture Components
1. Vectorized Engine
The Engine
struct in the vectorized
package is the central component responsible for:
- Analyzing queries to determine if they can be vectorized
- Extracting table and column information
- Converting row-based input to columnar batches
- Applying vectorized operations (filtering, projection)
- Converting back to row-based results
type Engine struct {
// Dependencies
functionRegistry contract.FunctionRegistry
}
2. Processors
Processors are composable components that operate on batches and transform them:
- Comparison Processors: Filter rows based on comparison conditions
CompareConstantProcessor
: Compares a column to a constant valueCompareColumnsProcessor
: Compares two columnsSIMDComparisonProcessor
: Optimized numeric comparison using SIMD instructions
- Logical Processors: Combine multiple conditions
AndProcessor
: Processes batches with AND logicOrProcessor
: Combines results with OR logicNotProcessor
: Inverts the result of another processor
- Arithmetic Processors: Perform mathematical operations on columns
3. SIMD Operations
The simd.go
file contains optimized operations for vectorized processing:
- Basic Vector Operations:
- Addition, subtraction, multiplication, division
- Scalar-vector operations (add scalar to vector, multiply vector by scalar)
- Comparison operations (equals, not equals, greater than, etc.)
- Statistical Functions:
- Sum, mean, min/max
- Loop unrolling techniques: ```go // Process 8 elements at a time with manual loop unrolling i := 0 for ; i <= length-8; i += 8 { dst[i] = a[i] * b[i] dst[i+1] = a[i+1] * b[i+1] dst[i+2] = a[i+2] * b[i+2] // … }
// Handle remaining elements for ; i < length; i++ { dst[i] = a[i] * b[i] }
## Query Flow
1. **Query Analysis**: The engine analyzes the SQL query to determine if it's suitable for vectorized execution.
2. **Data Fetching**: The base data is fetched from the storage layer.
3. **Batch Creation**: The row-based data is converted to columnar batches.
4. **Filter Application**: WHERE clauses are applied using optimized vector operations.
5. **Expression Evaluation**: SELECT list expressions are processed in a vectorized manner.
6. **Result Formation**: The final batch is converted back to a row-based result that implements the `storage.Result` interface.
## Optimization Techniques
### SIMD-Compatible Operations
The engine detects operations that can benefit from SIMD optimization:
- Numeric comparisons (=, !=, <, >, etc.)
- Arithmetic operations (+, -, *, /)
- Special functions (sqrt, exp, log)
### Memory Locality
Operations are designed to maximize cache efficiency:
- Processing data in small blocks (typically 64 elements)
- Minimizing pointer chasing
- Sequential memory access patterns
### Expression Optimization
The engine analyzes expressions to determine optimal execution paths:
- Fast paths for common operations
- Optimized handling of constants
- Instruction-level parallelism with multiple accumulators
### Selection Vectors
Instead of materializing intermediate results for filtering operations, the engine uses selection vectors (arrays of row indices) to track which rows pass filters, reducing memory usage and copy operations.
## Performance Considerations
- Vectorized execution shows the greatest benefits for compute-intensive operations.
- For very small data sets, the overhead of vectorization might outweigh the benefits.
- Complex expressions involving many arithmetic operations benefit most from vectorization.
- SIMD operations work best on integer and floating-point data types.
## Limitations
Currently, the vectorized engine supports:
- Simple SELECT queries without:
- Complex joins
- Aggregations
- Window functions
- Subqueries
Future extensions will gradually add support for more complex operations in a vectorized manner.
## Usage Example
The vectorized engine is integrated with the SQL executor and can be used transparently for suitable queries:
```go
// Execute a SQL query with vectorization if possible
func executePotentiallyVectorized(ctx context.Context, tx storage.Transaction, stmt *parser.SelectStatement) (storage.Result, error) {
// Check if the query is suitable for vectorization
if vectorizedEngine.isVectorizable(stmt) {
// Execute with vectorized engine
return vectorizedEngine.ExecuteQuery(ctx, tx, stmt)
}
// Fall back to traditional execution
return standardExecutor.ExecuteQuery(ctx, tx, stmt)
}
Overview
Based on the code in /internal/sql/executor/vectorized/
and test files like /test/vectorized_execution_test.go
, Stoolap implements a vectorized execution engine that processes data in batches rather than row by row. This approach significantly improves performance for analytical queries by leveraging modern CPU architectures and SIMD (Single Instruction, Multiple Data) capabilities.
How Vectorized Execution Works
Instead of the traditional row-by-row processing model, Stoolap’s vectorized engine:
- Organizes data from row-based storage into columnar batches (values from the same column are stored together)
- Processes batches of values at once (typically 1024-32K rows per batch)
- Applies operations to entire vectors of data simultaneously
- Uses CPU-friendly memory layouts to maximize cache efficiency
Key Components
Batch Processing
From /internal/sql/executor/vectorized/batch.go
:
// A Batch represents a collection of columns
type Batch struct {
// Columns organized by data type for efficient processing
IntColumns [][]int64
FloatColumns [][]float64
StringColumns [][]string
BoolColumns [][]bool
// ...
}
Batches organize data by column and type, allowing for optimized processing.
SIMD Optimization
From /internal/sql/executor/vectorized/simd.go
:
// AddFloat64SIMD adds two float vectors using SIMD-friendly patterns
func AddFloat64SIMD(a, b, result []float64) {
// Process 8 elements at a time where possible
n := len(a)
i := 0
for ; i <= n-8; i += 8 {
result[i] = a[i] + b[i]
result[i+1] = a[i+1] + b[i+1]
result[i+2] = a[i+2] + b[i+2]
result[i+3] = a[i+3] + b[i+3]
result[i+4] = a[i+4] + b[i+4]
result[i+5] = a[i+5] + b[i+5]
result[i+6] = a[i+6] + b[i+6]
result[i+7] = a[i+7] + b[i+7]
}
// Process remaining elements
for ; i < n; i++ {
result[i] = a[i] + b[i]
}
}
This pattern allows the Go compiler to auto-vectorize the code, leveraging SIMD instructions when available.
Vectorized Operators
From /internal/sql/executor/vectorized/processor.go
:
// Various operator processors for different operations
type (
// Processes filtering operations on batches
FilterProcessor interface {
Process(batch *Batch) *Batch
}
// Processes projections (SELECT expressions)
ProjectionProcessor interface {
Process(batch *Batch) *Batch
}
// ...other processor types
)
These processors apply operations to entire batches at once.
Supported Vectorized Operations
Based on the implementation and test files, Stoolap supports the following operations in vectorized mode:
Arithmetic Operations
- Addition, subtraction, multiplication, division
- Vector-vector operations (column OP column)
- Vector-scalar operations (column OP constant)
Comparison Operations
- Equality/inequality (=, !=, <>)
- Relational operators (>, >=, <, <=)
- Type-specific optimized comparisons
Logical Operations
- AND, OR operations for filter conditions
- NOT operator for inverting results
String Operations
- Basic string comparisons
- Pattern matching for LIKE operations
Performance Benefits
According to benchmark tests:
- Analytical queries: 2-10x speedup compared to row-by-row execution
- Arithmetic-heavy operations: 3-5x performance improvement
- Filter operations: 2-3x faster for simple predicates
- Memory efficiency: Better cache utilization reduces memory stalls
From /test/vectorized_benchmark_test.go
:
BenchmarkVectorizedSumInt64-8 100 15234190 ns/op // Vectorized
BenchmarkRowBasedSumInt64-8 100 47891235 ns/op // Row-based
This shows vectorized execution is approximately 3x faster for summing integer columns.
Query Types That Benefit Most
Vectorized execution provides the greatest benefit for:
- Analytical queries that process large amounts of data
- Filter-heavy operations with simple conditions
- Arithmetic computations on numeric columns
- Aggregations over large tables
- Projections with simple expressions
Example From Test Files
From /test/simple_vectorized_test.go
:
// Test a simple vectorized addition of two columns
func TestSimpleVectorizedAddition(t *testing.T) {
// Create sample data
const rows = 1000
aValues := make([]float64, rows)
bValues := make([]float64, rows)
for i := 0; i < rows; i++ {
aValues[i] = float64(i)
bValues[i] = float64(i * 2)
}
// Create batch with columns
batch := &Batch{
FloatColumns: [][]float64{aValues, bValues},
}
// Execute vectorized addition
result := AddFloat64Columns(batch, 0, 1)
// Verify results
for i := 0; i < rows; i++ {
expected := float64(i) + float64(i*2)
if result[i] != expected {
t.Errorf("Row %d: Expected %f, got %f", i, expected, result[i])
}
}
}
Implementation Details
Batch Size Considerations
Stoolap uses variable batch sizes with these characteristics:
- Default batch size: 1024 rows
- Maximum batch size: 32K rows (1024 * 32)
- Batch sizes are tuned for CPU cache efficiency
- Larger batches improve SIMD utilization but use more memory
Memory Management
The vectorized engine uses memory pooling to reduce allocations:
- Reuses batch structures when possible
- Pre-allocates column arrays for common operations
- Minimizes garbage collection overhead during query execution
Automatic Fallback
For operations not supported by the vectorized engine, Stoolap automatically falls back to row-based execution:
- Complex expressions may execute in row mode
- Some function calls might not be vectorized
- Certain data types may have limited vectorized support
Best Practices
For optimal performance with Stoolap’s vectorized execution:
- Prefer simple filters: Use basic comparison operations where possible
- Process data in batches: Load and process larger datasets at once
- Limit complex expressions: Simple expressions vectorize better
- Consider column types: Numeric operations benefit most from vectorization
- Use appropriate indexing: Even with vectorization, indexes are still important
Current Limitations
The vectorized execution engine has some limitations:
- Not all SQL operations are vectorized
- Limited support for complex string operations
- Some functions operate in row mode even in vectorized queries
- Window functions have limited vectorization
- Advanced join algorithms are partially vectorized
Future Development
Based on the code and tests, future enhancements may include:
- More vectorized operations and functions
- Better string operation support
- Enhanced vectorized aggregation
- Additional data type support
- Direct CPU SIMD instruction usage