Storage Engine
Storage Engine
This document provides a detailed overview of Stoolap’s storage engine, including its design principles, components, and how data is stored and retrieved.
Storage Engine Design
Stoolap’s storage engine is designed with the following principles:
- HTAP Architecture - Combines row-based storage with columnar indexing for hybrid workloads
- Memory-optimized - Prioritizes in-memory performance with optional persistence
- MVCC-based - Uses multi-version concurrency control for transaction isolation
- Version-organized - Tracks different versions of rows for transaction isolation
- Type-specialized - Uses different strategies for different data types
- Index-accelerated - Multiple index types to optimize different query patterns
Storage Components
Table Structure
Tables in Stoolap are composed of:
- Metadata - Schema information, column definitions, and indexes
- Row Data - The primary data storage, organized by row
- Version Store - Tracks row versions for MVCC
- Columnar Indexes - Optional column-oriented indexes for analytical queries
- Transaction Manager - Manages transaction state and visibility
Data Types
Stoolap supports a variety of data types, each with optimized storage:
- Int64 - Optimized for 64-bit integers
- Float64 - Optimized for 64-bit floating-point numbers
- String - Optimized for variable-length string data
- Bool - Optimized for boolean values
- Timestamp - Optimized for datetime values
- JSON - Optimized for JSON documents
- NULL values - Efficiently tracked for any data type
Version Management
Stoolap tracks different versions of data for transaction isolation:
- Each change creates a new version rather than overwriting
- Versions are associated with transaction IDs
- Visibility rules determine which versions each transaction can see
- Old versions are garbage collected when no longer needed
Data Storage Format
In-Memory Format
In memory, data is stored with these characteristics:
- Row-based primary storage - Records are stored as coherent rows
- Version chains - Linked versions for MVCC
- Columnar indexes - Column-oriented organization for analytical access
- Type-specific structures - Optimized for different data types
- Bitmap filters - Fast filtering for lookups
On-Disk Format
When persistence is enabled, data is stored on disk with:
- Binary serialization - Compact binary format for storage
- Row files - Primary data in row format
- Index files - Columnar index data
- Metadata files - Schema and index information
- WAL files - Write-ahead log for durability
- Snapshot files - Point-in-time table snapshots
MVCC Implementation
The storage engine uses MVCC to provide transaction isolation:
- Row Versioning - Multiple versions of each row are maintained
- Transaction IDs - Each version is associated with a transaction ID
- Visibility Rules - Determine which versions are visible to each transaction
- Garbage Collection - Old versions are cleaned up when no longer needed
For more details, see the MVCC Implementation and Transaction Isolation documentation.
Data Access Paths
OLTP Access Path
For transactional operations:
- The storage engine uses row-based access for efficiency
- Indexes help locate specific rows quickly
- Visibility rules are applied to ensure transaction isolation
- Point operations are optimized for low latency
OLAP Access Path
For analytical operations:
- Columnar indexes are used to minimize data access
- Batch processing is used for vectorized execution
- Filter pushdown optimizes query execution
- Parallel scanning improves throughput
Data Modification
Insert Operations
When data is inserted:
- Values are validated against column types
- A new row version is created with the current transaction ID
- The row is added to the primary row storage
- Columnar indexes are updated
- The operation is recorded in the WAL (if enabled)
Update Operations
When data is updated:
- The existing row versions are located
- Visibility rules determine which versions to update
- New row versions are created with updated values
- Old versions are marked as deleted by the current transaction
- Indexes are updated to reflect the changes
- The operation is recorded in the WAL (if enabled)
Delete Operations
When data is deleted:
- The existing row versions are located
- Visibility rules determine which versions to delete
- Matched versions are marked as deleted by the current transaction
- Indexes are updated to reflect the deletions
- The operation is recorded in the WAL (if enabled)
Persistence and Recovery
When persistence is enabled:
Write-Ahead Logging (WAL)
- All modifications are recorded in the WAL before being applied
- WAL entries include transaction ID, operation type, and data
- WAL is flushed to disk based on sync mode configuration
- This ensures durability in case of crashes
Snapshots
- Periodically, consistent snapshots of tables are created
- Snapshots represent a point-in-time state of each table
- Multiple snapshots may be retained for safety
- Snapshots accelerate recovery compared to replaying the entire WAL
Recovery Process
After a crash, recovery proceeds as follows:
- The latest valid snapshot is loaded for each table
- WAL entries after the snapshot are replayed
- Transaction state is reconstructed
- Incomplete transactions are rolled back
Memory Management
The storage engine includes several memory optimization techniques:
Buffer Pool
- Reusable memory buffers reduce allocation overhead
- Buffers are managed in pools by size categories
- Used for both in-memory operations and disk I/O
Value Pool
- Specialized object pooling for common data types
- Reduces garbage collection pressure
- Particularly beneficial for string and structural data
Segment Maps
- Efficient concurrent data structures
- Optimized for different access patterns
- Specialized implementations for common key types (Int64)
Implementation Details
Core storage engine components:
- storage.go - Storage interfaces and factory functions
- column.go - Column type implementations
- mvcc/ - MVCC implementation components
- engine.go - MVCC storage engine
- table.go - Table implementation with row-based storage
- transaction.go - Transaction management
- version_store.go - Version tracking for rows
- columnar_index.go - Columnar indexing for analytics
- scanner.go - Data scanning
- disk_version_store.go - Persistence implementation
- wal_manager.go - Write-ahead logging
- compression/ - Data compression implementations
- expression/ - Storage-level expression evaluation
HTAP Performance Characteristics
OLTP Performance
- Fast Point Operations - Row-based storage optimizes transaction processing
- Low Latency - Designed for quick response time
- High Concurrency - MVCC enables simultaneous access
- Efficient Writes - Optimized for insert and update operations
OLAP Performance
- Columnar Scanning - Columnar indexes optimize analytical queries
- Predicate Pushdown - Filtering at the lowest level
- Vectorized Processing - Batch operations for high throughput
- Parallel Execution - Multiple cores utilized efficiently
Hybrid Benefits
- No ETL - Data immediately available for both workloads
- Real-time Analytics - Query live transactional data
- Consistent View - Transaction isolation across all operations
- Unified Management - Single system for all database operations