Data Compression in Stoolap
Data Compression in Stoolap
Stoolap implements multiple specialized compression algorithms to reduce storage requirements while maintaining query performance. This document outlines the compression techniques used in Stoolap and how they benefit different types of data.
Compression Architecture
Stoolap uses a column-oriented compression approach:
- Each column is compressed independently
- Compression algorithms are selected based on data type and patterns
- Decompression is performed on-demand during query execution
- Different compression strategies can be applied to different columns
Compression Algorithms
Stoolap implements the following compression algorithms:
Dictionary Compression
Applied to columns with repeating values, especially effective for strings with low to medium cardinality:
- Maps unique values to integer identifiers
- Stores the mapping in a dictionary
- References dictionary values instead of storing duplicate values
- Particularly effective for columns like status codes, categories, countries, etc.
Run-Length Encoding (RLE)
Optimal for columns with sequences of repeated values:
- Stores value and count pairs for consecutive identical values
- Especially effective for boolean columns, sparse columns, or sorted data
- Also beneficial for columns with low cardinality values like enum types
Delta Encoding
Applied to columns with gradually changing numeric values:
- Stores the differences between consecutive values instead of actual values
- Works exceptionally well for timestamps, sequential IDs, or sorted numbers
- Different variants optimize for different patterns of numeric data
Bit-Packing
Used for integer columns with values in a limited range:
- Determines the minimum number of bits needed to represent values
- Packs multiple values into a single storage unit (words)
- Effective for small integers, boolean values, and low-cardinality data
Time-Specific Compression
Specialized for datetime and timestamp columns:
- Decomposes timestamps into components (year, month, day, hour, etc.)
- Applies different compression strategies to each component
- Particularly effective for time-series data with regular patterns
JSON Compression
Specialized for JSON columns:
- Structure-aware compression for JSON documents
- Shares common keys and patterns across documents
- Efficiently handles nested structures and arrays
Automatic Algorithm Selection
Stoolap automatically selects the optimal compression algorithm for each column based on:
- Data type (numeric, string, boolean, datetime, JSON)
- Data distribution and patterns
- Column cardinality (number of unique values)
- Query access patterns
Compression Ratios
Typical compression ratios achieved in Stoolap:
- String columns with low cardinality: 10-20x
- Numeric columns with patterns: 3-10x
- Timestamp columns: 4-8x
- Boolean or flag columns: 8-16x
- JSON documents: 2-5x
Performance Impact
Stoolap’s compression affects performance in several ways:
Benefits
- Reduced I/O - Less data transferred from disk to memory
- Better cache efficiency - More data fits in CPU cache
- Reduced memory usage - More data can be processed in memory
- Some operations can work directly on compressed data
Considerations
- CPU overhead - Compression/decompression requires CPU cycles
- Query selectivity impact - Highly selective queries may not benefit as much
- Write amplification - Writes may require more processing
Implementation Details
Stoolap’s compression is implemented with the following components:
- compression.go - Compression interfaces and factory
- bitpack_compressor.go - Bit-packing implementation
- delta_compressor.go - Delta encoding implementation
- dictionary_compressor.go - Dictionary compression implementation
- runlength_compressor.go - Run-length encoding implementation
- time_compressor.go - Specialized datetime compression
- json_compressor.go - JSON document compression
Best Practices
To get the most out of Stoolap’s compression:
- Choose appropriate column data types - Using the most specific data type improves compression
- Order tables on highly compressible columns - Improves run-length and delta encoding
- Use enum-like values for categorization - Improves dictionary compression
- Structure JSON data consistently - Improves JSON compression
- Normalize data when appropriate - Can improve overall compression ratio
Future Improvements
Stoolap’s compression system is designed to evolve with new algorithms and optimizations:
- Adaptive compression based on workload patterns
- Hybrid compression algorithms for mixed data patterns
- Runtime adjustable compression levels
- SIMD-accelerated compression/decompression