Persistence

Stoolap provides durable storage through a combination of Write-Ahead Logging (WAL) and immutable cold volumes. This architecture ensures data durability while maintaining high performance, even for tables with millions of rows.

Overview

Stoolap’s persistence layer consists of three components:

  1. Write-Ahead Log (WAL): Records all changes before they are applied to memory
  2. Cold Volumes: Immutable columnar storage for sealed historical data
  3. Backup Snapshots: Optional point-in-time backup files for disaster recovery

This approach provides:

  • Durability: Changes are persisted before acknowledgment
  • Fast Recovery: Volumes load in milliseconds instead of seconds
  • Crash Safety: WAL ensures no committed transactions are lost
  • One Invariant: For any row_id, the newest source wins. Hot overrides cold. Newer volumes override older volumes.

Enabling Persistence

To enable persistence, use a file:// connection string:

use stoolap::Database;

// In-memory only (no persistence)
let db = Database::open("memory://")?;

// With disk persistence
let db = Database::open("file:///path/to/database")?;

Command line:

# In-memory
stoolap

# With persistence
stoolap --db "file:///path/to/database"

Write-Ahead Log (WAL)

How WAL Works

  1. When a transaction commits, changes are first written to the WAL file
  2. The WAL is synced to disk (based on sync_mode)
  3. Changes are then applied to the in-memory structures
  4. Transaction is acknowledged to the client

This sequence ensures that committed transactions survive crashes.

WAL Operations

The WAL records these operations:

  • INSERT: New row insertions
  • UPDATE: Row modifications
  • DELETE: Row deletions
  • TRUNCATE: Bulk row removal
  • CREATE TABLE: Table creation (DDL)
  • DROP TABLE: Table deletion (DDL)
  • CREATE INDEX: Index creation (DDL)
  • DROP INDEX: Index deletion (DDL)
  • ALTER TABLE: Schema modifications (ADD/DROP/RENAME/MODIFY COLUMN, RENAME TABLE)
  • CREATE VIEW / DROP VIEW: View management (DDL)

WAL Configuration

Configure WAL behavior via the connection string (these settings take effect at database open time and cannot be changed at runtime):

file:///path/to/db?sync_mode=normal&wal_flush_trigger=65536

Read current values with PRAGMA:

PRAGMA sync_mode;           -- returns current mode (0, 1, or 2)
PRAGMA wal_flush_trigger;   -- returns current buffer size
Sync Mode Value Behavior Max Data Loss
None 0 No fsync. WAL is written to the OS buffer cache only. Data becomes durable at the next checkpoint cycle (every 60s by default) when it is sealed into fsynced volume files. ~60s (checkpoint interval)
Normal 1 Fsync WAL at most once per second and on DDL operations. Similar to SQLite WAL mode with synchronous=NORMAL. ~1s
Full 2 Fsync on every WAL write. Maximum durability at the cost of write throughput. 0

WAL Files

WAL files are stored in the database directory:

/path/to/database/
  wal/
    wal_000001.log
    wal_000002.log
    ...

Old WAL files are automatically cleaned up after successful checkpoints when all hot data has been sealed into volumes.

Cold Volumes (Immutable Segments)

Cold volumes are the primary persistence mechanism. Each table is logically: Hot Delta + Cold Immutable Segments. Hot data lives in the MVCC B-tree (WAL-backed, mutable). Cold data lives in frozen volumes (columnar, zone maps, bloom filters, dictionary encoding, CRC32 integrity).

How Volumes Work

The background checkpoint thread periodically seals hot buffer rows into cold segments:

  1. Seal: When a table accumulates enough committed rows in hot (100K for first seal, 10K incremental), the rows are written to a columnar .vol file in volumes/<table>/
  2. Remove from hot: Sealed rows are removed from the B-tree and hot indexes
  3. Register segment: The new cold segment is registered in the segment manager
  4. Persist manifest: The manifest (volume list, tombstones, checkpoint LSN) is written to disk

On clean shutdown, a force checkpoint seals ALL remaining hot rows regardless of threshold.

When snapshot isolation transactions are active, seal uses a cutoff filter: only rows committed before the earliest snapshot’s begin_seq are sealed. Rows committed after stay in the hot buffer where MVCC visibility handles them correctly. This allows seal and compaction to proceed during long-running snapshot transactions instead of being blocked entirely.

Per-Volume Skip Sets (Deduplication)

For any row_id, the newest source wins:

  1. Hot buffer rows override all cold volumes
  2. Newer cold volumes override older cold volumes
  3. Tombstones (manifest-tracked) mark cold rows deleted by DML

During scans, a cumulative skip set is built:

  1. Start with hot row_ids + committed tombstones
  2. For each volume (newest first): scanner gets the skip set, then the volume’s row_ids are added for older volumes
  3. Results are merged with a chain scanner

No separate deduplication step is needed.

Tombstones

When a cold row is deleted or updated, a versioned tombstone is recorded as a (row_id, commit_seq) pair:

  • Per-transaction pending: Tracked on the segment manager, applied at commit with the transaction’s commit sequence
  • Committed tombstones: Stored in the manifest (V6 format), persisted atomically
  • Snapshot isolation: A snapshot transaction at begin_seq=N only sees tombstones with commit_seq <= N. Newer tombstones are invisible, so the original cold row remains visible to the older snapshot. Auto-commit transactions see all tombstones.
  • Cleared after compaction: Rows are physically removed from the merged volume
  • Cleared after seal: When sealed row_ids overlap with existing tombstones, those tombstones are removed (the new volume is authoritative)

Compaction

Compaction keeps volumes at a bounded target size (target_volume_rows, default 1M rows). Only volumes that need work are rewritten. At-target volumes with no tombstones are frozen and never touched.

Three categories trigger compaction:

  • Sub-target: Volumes smaller than target_volume_rows (from recent seals). Merged together when count exceeds compact_threshold.
  • Oversized: Volumes larger than 150% of target (legacy or bulk-loaded). Split into target-sized volumes.
  • Dirty: At-target volumes with tombstoned rows (DELETEd or UPDATEd). Rewritten to physically remove dead rows and merge in newer versions.

The compaction process:

  1. Select volumes by category (sub-target + oversized + dirty). Leave clean at-target volumes untouched.
  2. Iterate selected volumes newest-first, collecting live rows (dedup by row_id, newest wins)
  3. Split output into row-group aligned volumes (rounded down to nearest multiple of 64K rows from target_volume_rows)
  4. Write each output volume to disk
  5. Atomically register all new volumes and remove old ones (no visibility gap)
  6. Clear tombstones for row_ids in the merged volumes
  7. Persist manifest before deleting old volume files (crash safety)
  8. Cutoff-filtered during snapshot isolation: only rows committed before the earliest snapshot’s begin_seq are compacted. Post-snapshot tombstones are preserved. Volume seal_seq metadata tracks when each volume was created for compaction eligibility

What Volumes Optimize

Frozen volumes store data column by column with multiple query acceleration techniques:

  • Startup: Loading cold data from volumes takes milliseconds instead of seconds
  • Aggregation without WHERE: SELECT SUM(x), MIN(x), MAX(x), AVG(x), COUNT(*) answers from pre-computed per-volume statistics without scanning any row data
  • Filtered aggregation (columnar pushdown): SELECT SUM(x) FROM t WHERE col > 10 evaluates predicates and accumulates aggregates directly on raw i64/f64 arrays. Zero Value or Row object construction. Dictionary-encoded text equality predicates resolve to u32 dict_id comparisons
  • Grouped aggregation: SELECT exchange, COUNT(*) FROM t GROUP BY exchange uses dictionary-indexed accumulator arrays for text columns (zero hashing in the inner loop) and FxHashMap for integer/timestamp columns
  • DISTINCT: SELECT DISTINCT exchange FROM t extracts unique values from per-volume dictionary metadata without scanning rows
  • ORDER BY PK + LIMIT: SELECT * FROM t ORDER BY id LIMIT 10 uses a k-way merge across sorted volume row_ids and the hot buffer B-tree, stopping after the requested rows
  • IN list pruning: WHERE id IN (1, 2, 3) derives min/max bounds from the IN values for zone-map pruning
  • Zone map pruning: Per-column min/max metadata per volume and per 64K row group. Volumes that cannot match the WHERE clause are skipped entirely
  • Bloom filters: Per-column bloom filters for fast equality rejection on text columns
  • Binary search: Sorted columns (like the primary key) use binary search to narrow scan ranges within a volume
  • Parallel scanning: When the parallel feature is enabled (default), multiple volumes are scanned concurrently using rayon. A threshold guard ensures small queries use the sequential path to avoid scheduling overhead
  • OFFSET skip: Large OFFSET values skip row materialization for discarded rows, only constructing Row objects for rows that will be returned

Volume File Layout

/path/to/database/
  volumes/
    table_name/
      vol_00064d50e5946141.vol    # Sealed cold segment
      manifest.bin                # Segment metadata (volumes, tombstones, checkpoint LSN)

Volume files include a trailing CRC32 checksum for corruption detection. Bloom filters are serialized alongside the volume data (no rebuild on load).

Checkpoint Cycle

PRAGMA CHECKPOINT (and the periodic background cycle) executes:

  1. Seal: Move hot buffer rows into new immutable .vol files (per-table seal fence ensures DML consistency)
  2. Fence: Brief exclusive check that all hot buffers are empty, advance WAL checkpoint LSN
  3. Re-record DDL: Write DDL entries after checkpoint LSN so they survive WAL truncation
  4. Persist manifests: Write manifest.bin per table (volume list, tombstones, checkpoint LSN) atomically via fsync-before-rename
  5. WAL truncate: Remove WAL entries before checkpoint LSN (only when all manifests are persisted)
  6. Compact: Merge sub-target, oversized, and tombstoned volumes into target-sized outputs. Runs inline for explicit PRAGMA CHECKPOINT, offloaded to a background thread for the periodic automatic cycle.

Constraints and DML

Volume-backed tables enforce all constraints:

  • Primary key: Checked against cold segments using zone maps and binary search
  • Unique indexes: Checked against cold segments with zone map pruning, bloom filters, and per-volume hash indexes (built lazily on first lookup per column set, never invalidated since volumes are immutable)
  • UPDATE and DELETE: A tombstone is created for the cold row (deferred to commit). The new version goes to hot. The skip set handles dedup during scans
  • TRUNCATE and DROP TABLE: Cold segments and tombstones are cleared both in memory and on disk

Schema Changes

ALTER TABLE operations work on tables with frozen volumes. When the schema changes (ADD COLUMN, DROP COLUMN, etc.), volume data is normalized on read:

  • ADD COLUMN: New column returns NULL or the column’s DEFAULT value for volume rows. When rows are sealed or compacted, the default is materialized
  • DROP COLUMN: Dropped column is skipped during projection
  • RENAME COLUMN: Column matching is by name, so the renamed column maps correctly
  • MODIFY COLUMN: Type coercion is applied on read

No volume rebuild is needed. The next seal or compact cycle produces new segments with the updated schema.

Memory Model

Mode Hot buffer Cold segments
memory:// All data in arena None (no persistence)
file:// (small tables) All data in arena None (below seal threshold)
file:// (large tables) Recent rows only Sealed historical data in columnar volumes

For large tables, the hot buffer contains only rows added or modified since the last seal. Historical data stays in cold segments and is merged during scans.

Backup Snapshots

How Snapshots Work

PRAGMA SNAPSHOT creates a full backup of all tables to .bin files. This is separate from the checkpoint cycle and is intended for manual backup and disaster recovery.

Snapshots capture:

  1. All table schemas
  2. All committed data (both hot buffer and cold volume rows)
  3. Point-in-time consistency via commit sequence cutoff
  4. Index and view definitions in per-timestamp ddl-{timestamp}.bin files (BTree, Hash, Bitmap, HNSW indexes and all views)

The DDL file is critical for timestamped restore: snapshot .bin files contain only row data and schema, not index or view definitions. Timestamped restore requires the matching ddl-{timestamp}.bin and fails if it is missing. Non-timestamped restore uses current in-memory definitions as a fallback.

Snapshot Configuration

-- Number of backup snapshots to retain per table (default: 3)
PRAGMA keep_snapshots = 3;

-- Manually create a backup snapshot
PRAGMA SNAPSHOT;

-- Restore from latest backup snapshot
PRAGMA RESTORE;

-- Restore from a specific backup
PRAGMA RESTORE = '20260315-100000.000';

Backup and restore are also available from the CLI:

# Create backup
stoolap -d "file:///path/to/db" --snapshot

# Restore (requires database to open successfully)
stoolap -d "file:///path/to/db" --restore

# Restore when volumes/manifests are corrupted (cleans up first)
stoolap -d "file:///path/to/db" --reset-volumes --restore

Snapshot Files

Snapshots are stored as binary files, organized per table:

/path/to/database/
  snapshots/
    snapshot_meta.bin               # Global snapshot metadata
    ddl-20240101-120000.000.bin    # Index/view definitions for this timestamp
    manifest-20240101-120000.000.json  # Table list for this snapshot batch
    table_name/
      snapshot-20240101-120000.000.bin
    other_table/
      snapshot-20240101-120000.000.bin

Each table has its own subdirectory. Filenames include the timestamp of creation. Old snapshots beyond keep_snapshots are automatically cleaned up.

Lock File

The database lock file (db.lock) uses OS-level file locking (flock) to prevent concurrent access. The lock file is not deleted on shutdown because flock protects the inode, not the pathname. Deleting the file while the lock is held would allow a race where another process creates a new inode and acquires its own lock. The stale file on disk is harmless and is re-locked on the next open.

Recovery Process

When opening a database, Stoolap performs recovery automatically:

  1. Load Manifests + Volumes: For each table, load manifest.bin and cold volumes from volumes/<table>/
  2. Migrate Legacy Snapshots (v0.3.7 only): If snapshots/ exists but volumes/ does not, automatic migration runs (see below)
  3. Replay WAL: Apply operations logged after the checkpoint LSN
    • INSERT with row_id already in a volume: skip (idempotent)
    • INSERT with tombstoned row_id: apply to hot (post-seal UPDATE)
    • UPDATE/DELETE: apply to hot buffer (creates shadow)
    • Only committed transactions are applied
  4. Rebuild Indexes: Populate hot indexes from recovered data
  5. Sync Auto-increment: Ensure counters account for cold segment max row IDs
  6. Post-recovery Seal: If hot buffer has rows from WAL replay, seal them immediately to cold volumes. This prevents query slowness from a large hot buffer. Manifests are persisted so the sealed data survives another crash.
  7. Ready for Operations: Database is now available

Recovery is transparent to the application.

Migration from v0.3.7

Databases created with Stoolap v0.3.7 or earlier used a different persistence format based on snapshot .bin files. When opening such a database for the first time with a newer version:

  1. The engine detects the snapshots/ directory and the absence of volumes/
  2. All legacy snapshot data is loaded into the hot buffer
  3. The data is immediately sealed into immutable cold volumes in volumes/<table>/
  4. The old snapshots/ directory and snapshot_meta.bin are removed
  5. On subsequent opens, only the volume-based format is used

Migration is automatic and one-time. You will see [migration] messages on stderr during this process. No user intervention is required.

Crash Safety

Scenario Recovery Data loss?
Crash during DML WAL replay None
Crash during checkpoint (before manifest) Old manifest, WAL replays all None
Crash during manifest write Atomic rename: old or new None
Crash after manifest, before WAL truncation Volumes + redundant WAL, idempotent replay None
Crash during compaction (before manifest) Old volumes intact None
Crash after compaction manifest New volume, old files orphaned, cleaned up None

One atomic state transition: the manifest write. Everything before is preparation (discardable). Everything after is cleanup (repeatable).

Recovery Example

// Opening automatically triggers recovery if needed
let db = Database::open("file:///path/to/database")?;

// Database is ready with all committed data restored
let results = db.query("SELECT * FROM users", ())?;

Configuration Options

Connection String Parameters

file:///path/to/database?sync_mode=2&checkpoint_interval=60&compact_threshold=4&keep_snapshots=3
Parameter Description Default Runtime
sync_mode WAL sync mode: none, normal, full (or 0, 1, 2) normal DSN only
sync_interval_ms Minimum ms between syncs in normal mode 1000 DSN only
wal_flush_trigger Buffer size in bytes before WAL flush 32768 DSN only
wal_buffer_size WAL write buffer size in bytes 65536 DSN only
checkpoint_interval Seconds between checkpoints (0 = disabled) 60 PRAGMA
compact_threshold Sub-target volumes per table before merging 4 PRAGMA
target_volume_rows Target rows per cold volume (min 65536) 1048576 PRAGMA
keep_snapshots Backup snapshots to retain per table 3 PRAGMA
checkpoint_on_close Seal all hot rows on clean shutdown on DSN only

Legacy parameter names are accepted for backward compatibility:

  • snapshot_interval maps to checkpoint_interval
  • snapshot_compression maps to compression

PRAGMA Commands

-- Read current settings
PRAGMA sync_mode;
PRAGMA checkpoint_interval;
PRAGMA compact_threshold;
PRAGMA keep_snapshots;
PRAGMA wal_flush_trigger;

-- Modify settings
PRAGMA sync_mode = 2;
PRAGMA checkpoint_interval = 60;
PRAGMA compact_threshold = 4;
PRAGMA keep_snapshots = 5;
PRAGMA wal_flush_trigger = 10000;

-- Create a backup snapshot
PRAGMA snapshot;

-- Run checkpoint cycle manually
PRAGMA checkpoint;

Best Practices

Durability vs Performance

Choose sync_mode based on your requirements:

Use Case Recommended sync_mode Write Throughput
Development/Testing 0 (None) Highest. No fsync overhead. Data durable at checkpoint.
General Use (recommended) 1 (Normal) High. Fsync at most once per second. ~1s durability gap.
Financial/Critical Data 2 (Full) Lower. Fsync per WAL write. Zero data loss on crash.

Checkpoint Frequency

  • Frequent checkpoints (low interval): Faster recovery, more disk I/O
  • Infrequent checkpoints (high interval): Slower recovery, less disk I/O

For databases with high write rates, consider shorter intervals.

Disk Space

Monitor disk usage:

  • WAL files grow until the next checkpoint
  • Old volumes are compacted when threshold is exceeded
  • Old backup snapshots are retained based on keep_snapshots

Backup Strategy

For backups:

  1. Create a manual snapshot: PRAGMA snapshot;
  2. Copy the entire database directory while the database is idle
  3. For hot backups, use filesystem snapshots (ZFS, LVM)

Directory Structure

A persistent database creates this directory structure:

/path/to/database/
  db.lock                              # Lock file for single-writer
  wal/
    wal_NNNNNN.log                     # WAL segment files
    checkpoint.meta                    # WAL truncation metadata
  volumes/
    table_name/
      vol_XXXXXXXXXXXXXXXX.vol         # Sealed cold segment
      manifest.bin                     # Segment metadata (volumes, tombstones, checkpoint LSN)
  snapshots/                           # Created by PRAGMA SNAPSHOT (backup)
    snapshot_meta.bin                   # Global snapshot metadata
    ddl-TIMESTAMP.bin                  # Index/view definitions per snapshot batch
    manifest-TIMESTAMP.json            # Table list per snapshot batch
    table_name/
      snapshot-TIMESTAMP.bin           # Per-table backup snapshot files

Error Handling

Corrupt WAL

If WAL corruption is detected during recovery:

  • Stoolap attempts to recover up to the last valid entry
  • Corrupted entries at the end are discarded
  • A warning is logged

Disk Full

If disk becomes full:

  • WAL writes will fail
  • Transactions will be rolled back
  • Free disk space before continuing

Lock Contention

Only one process can open a database directory:

  • A lock file (db.lock) prevents concurrent access
  • If a previous process crashed, the lock is automatically released on open

Example: Complete Configuration

use stoolap::Database;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Open with custom persistence settings
    let db = Database::open(
        "file:///var/lib/myapp/data?sync_mode=2&checkpoint_interval=120&keep_snapshots=5"
    )?;

    // Fine-tune at runtime
    db.execute("PRAGMA wal_flush_trigger = 5000", ())?;
    db.execute("PRAGMA compact_threshold = 8", ())?;

    // Your application logic...
    db.execute("CREATE TABLE events (id INTEGER PRIMARY KEY AUTO_INCREMENT, data JSON)", ())?;

    // Force a backup snapshot before maintenance
    db.execute("PRAGMA snapshot", ())?;

    Ok(())
}