Storage Engine
Storage Engine
This document provides a detailed overview of Stoolap’s storage engine, including its design principles, components, and how data is stored and retrieved.
Storage Engine Design
Stoolap’s storage engine is designed with the following principles:
- Memory-optimized - Prioritizes in-memory performance with optional persistence
- MVCC-based - Uses multi-version concurrency control for transaction isolation
- Version-organized - Tracks different versions of rows for transaction isolation
- Type-specialized - Uses different strategies for different data types
- Index-accelerated - Multiple index types to optimize different query patterns
Storage Components
Table Structure
Tables in Stoolap are composed of:
- Metadata - Schema information, column definitions, and indexes
- Row Data - The primary data storage, organized by row
- Version Store - Tracks row versions for MVCC
- Indexes - B-tree, Hash, Bitmap, and multi-column indexes
- Transaction Manager - Manages transaction state and visibility
Data Types
Stoolap supports a variety of data types, each with optimized storage:
- INTEGER - 64-bit signed integers
- FLOAT - 64-bit floating-point numbers
- TEXT - Variable-length UTF-8 strings
- BOOLEAN - Boolean values (true/false)
- TIMESTAMP - Date and time values
- JSON - JSON documents
- NULL - Null values supported for all types
Version Management
Stoolap tracks different versions of data for transaction isolation:
- Each change creates a new version rather than overwriting
- Versions are associated with transaction IDs
- Visibility rules determine which versions each transaction can see
- Old versions are garbage collected when no longer needed
Data Storage Format
In-Memory Format
In memory, data is stored with these characteristics:
- Row-based primary storage - Records are stored as coherent rows
- Version chains - Linked versions for MVCC
- Type-specific indexes - B-tree, Hash, Bitmap based on column type
- Efficient structures - Optimized for different data types
On-Disk Format
When persistence is enabled, data is stored on disk with:
- Binary serialization - Compact binary format for storage
- WAL files - Write-ahead log for durability
- Snapshot files - Point-in-time table snapshots
- Metadata files - Schema and index information
MVCC Implementation
The storage engine uses MVCC to provide transaction isolation:
- Full Version Chains - Version history per row linked via pointers
- Transaction IDs - Each version is associated with a transaction ID
- Visibility Rules - Traverse version chains to find visible versions
- Lock-Free Reads - Readers never block writers
- Automatic Cleanup - Old versions garbage collected when no longer needed
For more details, see the MVCC Implementation and Transaction Isolation documentation.
Data Access Paths
Point Lookups
For point queries (e.g., WHERE id = 5):
- Use primary key or index to locate the row
- Apply visibility rules based on transaction
- Return the visible version
Range Scans
For range queries (e.g., WHERE price > 100):
- Use B-tree index if available for the column
- Scan matching index entries
- Apply visibility rules to each row
- Return visible results
Full Table Scans
For queries without applicable indexes:
- Scan all rows in the table
- Apply WHERE clause filters
- Apply visibility rules
- For large tables, parallelize the scan
Data Modification
Insert Operations
When data is inserted:
- Values are validated against column types
- A new row version is created with the current transaction ID
- The row is added to the primary row storage
- Indexes are updated
- The operation is recorded in the WAL (if enabled)
Update Operations
When data is updated:
- The existing row is located via indexes or scan
- A new version is created with updated values
- The new version links to the previous version
- Indexes are updated to reflect the changes
- The operation is recorded in the WAL (if enabled)
Delete Operations
When data is deleted:
- The existing row is located
- A deletion marker version is created
- Indexes are updated to reflect the deletion
- The operation is recorded in the WAL (if enabled)
Persistence and Recovery
When persistence is enabled:
Write-Ahead Logging (WAL)
- All modifications are recorded in the WAL before being applied
- WAL entries include transaction ID, operation type, and data
- WAL is flushed to disk for durability
- This ensures recovery in case of crashes
Snapshots
- Periodically, consistent snapshots of tables are created
- Snapshots contain the latest version of each row
- Snapshots accelerate recovery compared to replaying the entire WAL
Recovery Process
After a crash, recovery proceeds as follows:
- The latest valid snapshot is loaded for each table
- WAL entries after the snapshot are replayed
- Index definitions are restored and indexes rebuilt
- Incomplete transactions are rolled back
Implementation Details
Core storage engine components in the Rust codebase:
src/storage/
├── mod.rs # Storage module entry point
├── traits/ # Storage interfaces
│ ├── engine.rs # Engine trait
│ ├── table.rs # Table trait
│ └── transaction.rs # Transaction trait
└── mvcc/ # MVCC implementation
├── engine.rs # MVCC storage engine
├── table.rs # Table with row storage
├── transaction.rs # Transaction management
├── version_store.rs # Version tracking
├── btree_index.rs # B-tree index
├── hash_index.rs # Hash index
├── bitmap_index.rs # Bitmap index
├── multi_column_index.rs # Multi-column index
└── persistence.rs # WAL and snapshots
Performance Characteristics
Read Performance
- Point Lookups - O(1) with hash index, O(log n) with B-tree
- Range Scans - O(log n + k) with B-tree index
- Full Scans - Parallelized for large tables
Write Performance
- Inserts - O(log n) per index
- Updates - O(log n) per index plus version creation
- Deletes - O(log n) per index for marker creation
Concurrency
- High Read Concurrency - MVCC enables many concurrent readers
- Write Concurrency - Multiple writers with conflict detection
- No Read Locks - Readers never block on writes