Persistence

Stoolap provides durable storage through a combination of Write-Ahead Logging (WAL) and immutable cold volumes. This architecture ensures data durability while maintaining high performance, even for tables with millions of rows.

Overview

Stoolap’s persistence layer consists of three components:

Write-Ahead Log (WAL): Records all changes before they are applied to memory
Cold Volumes: Immutable columnar storage for sealed historical data
Backup Snapshots: Optional point-in-time backup files for disaster recovery

This approach provides:

Durability: Changes are persisted before acknowledgment
Fast Recovery: Volumes load in milliseconds instead of seconds
Crash Safety: WAL ensures no committed transactions are lost
One Invariant: For any row_id, the newest source wins. Hot overrides cold. Newer volumes override older volumes.

Enabling Persistence

To enable persistence, use a file:// connection string:

use stoolap::Database;

// In-memory only (no persistence)
let db = Database::open("memory://")?;

// With disk persistence
let db = Database::open("file:///path/to/database")?;

Command line:

# In-memory
stoolap

# With persistence
stoolap --db "file:///path/to/database"

Write-Ahead Log (WAL)

How WAL Works

When a transaction commits, changes are first written to the WAL file
The WAL is synced to disk (based on sync_mode)
Changes are then applied to the in-memory structures
Transaction is acknowledged to the client

This sequence ensures that committed transactions survive crashes.

WAL Operations

The WAL records these operations:

INSERT: New row insertions
UPDATE: Row modifications
DELETE: Row deletions
TRUNCATE: Bulk row removal
CREATE TABLE: Table creation (DDL)
DROP TABLE: Table deletion (DDL)
CREATE INDEX: Index creation (DDL)
DROP INDEX: Index deletion (DDL)
ALTER TABLE: Schema modifications (ADD/DROP/RENAME/MODIFY COLUMN, RENAME TABLE)
CREATE VIEW / DROP VIEW: View management (DDL)

WAL Configuration

Configure WAL behavior via the connection string (these settings take effect at database open time and cannot be changed at runtime):

file:///path/to/db?sync_mode=normal&wal_flush_trigger=65536

Read current values with PRAGMA:

PRAGMA sync_mode;           -- returns current mode (0, 1, or 2)
PRAGMA wal_flush_trigger;   -- returns current buffer size

Sync Mode	Value	Behavior	Max Data Loss
None	0	No fsync. WAL is written to the OS buffer cache only. Data becomes durable at the next checkpoint cycle (every 60s by default) when it is sealed into fsynced volume files.	~60s (checkpoint interval)
Normal	1	Fsync WAL at most once per second and on DDL operations. Similar to SQLite WAL mode with `synchronous=NORMAL`.	~1s
Full	2	Fsync on every WAL write. Maximum durability at the cost of write throughput.	0

WAL Files

WAL files are stored in the database directory:

/path/to/database/
  wal/
    wal_000001.log
    wal_000002.log
    ...

Old WAL files are automatically cleaned up after successful checkpoints when all hot data has been sealed into volumes.

Cold Volumes (Immutable Segments)

Cold volumes are the primary persistence mechanism. Each table is logically: Hot Delta + Cold Immutable Segments. Hot data lives in the MVCC B-tree (WAL-backed, mutable). Cold data lives in frozen volumes (columnar, zone maps, bloom filters, dictionary encoding, CRC32 integrity).

How Volumes Work

The background checkpoint thread periodically seals hot buffer rows into cold segments:

Seal: When a table accumulates enough committed rows in hot (100K for first seal, 10K incremental), the rows are written to a columnar .vol file in volumes/<table>/
Remove from hot: Sealed rows are removed from the B-tree and hot indexes
Register segment: The new cold segment is registered in the segment manager
Persist manifest: The manifest (volume list, tombstones, checkpoint LSN) is written to disk

On clean shutdown, a force checkpoint seals ALL remaining hot rows regardless of threshold.

When snapshot isolation transactions are active, seal uses a cutoff filter: only rows committed before the earliest snapshot’s begin_seq are sealed. Rows committed after stay in the hot buffer where MVCC visibility handles them correctly. This allows seal and compaction to proceed during long-running snapshot transactions instead of being blocked entirely.

Per-Volume Skip Sets (Deduplication)

For any row_id, the newest source wins:

Hot buffer rows override all cold volumes
Newer cold volumes override older cold volumes
Tombstones (manifest-tracked) mark cold rows deleted by DML

During scans, a cumulative skip set is built:

Start with hot row_ids + committed tombstones
For each volume (newest first): scanner gets the skip set, then the volume’s row_ids are added for older volumes
Results are merged with a chain scanner

No separate deduplication step is needed.

Tombstones

When a cold row is deleted or updated, a versioned tombstone is recorded as a (row_id, commit_seq) pair:

Per-transaction pending: Tracked on the segment manager, applied at commit with the transaction’s commit sequence
Committed tombstones: Stored in the manifest (V6 format), persisted atomically
Snapshot isolation: A snapshot transaction at begin_seq=N only sees tombstones with commit_seq <= N. Newer tombstones are invisible, so the original cold row remains visible to the older snapshot. Auto-commit transactions see all tombstones.
Cleared after compaction: Rows are physically removed from the merged volume
Cleared after seal: When sealed row_ids overlap with existing tombstones, those tombstones are removed (the new volume is authoritative)

Compaction

Compaction keeps volumes at a bounded target size (target_volume_rows, default 1M rows). Only volumes that need work are rewritten. At-target volumes with no tombstones are frozen and never touched.

Three categories trigger compaction:

Sub-target: Volumes smaller than target_volume_rows (from recent seals). Merged together when count exceeds compact_threshold.
Oversized: Volumes larger than 150% of target (legacy or bulk-loaded). Split into target-sized volumes.
Dirty: At-target volumes with tombstoned rows (DELETEd or UPDATEd). Rewritten to physically remove dead rows and merge in newer versions.

The compaction process:

Select volumes by category (sub-target + oversized + dirty). Leave clean at-target volumes untouched.
Iterate selected volumes newest-first, collecting live rows (dedup by row_id, newest wins)
Split output into row-group aligned volumes (rounded down to nearest multiple of 64K rows from target_volume_rows)
Write each output volume to disk
Atomically register all new volumes and remove old ones (no visibility gap)
Clear tombstones for row_ids in the merged volumes
Persist manifest before deleting old volume files (crash safety)
Cutoff-filtered during snapshot isolation: only rows committed before the earliest snapshot’s begin_seq are compacted. Post-snapshot tombstones are preserved. Volume seal_seq metadata tracks when each volume was created for compaction eligibility

What Volumes Optimize

Frozen volumes store data column by column with multiple query acceleration techniques:

Startup: Loading cold data from volumes takes milliseconds instead of seconds
Aggregation without WHERE: SELECT SUM(x), MIN(x), MAX(x), AVG(x), COUNT(*) answers from pre-computed per-volume statistics without scanning any row data
Filtered aggregation (columnar pushdown): SELECT SUM(x) FROM t WHERE col > 10 evaluates predicates and accumulates aggregates directly on raw i64/f64 arrays. Zero Value or Row object construction. Dictionary-encoded text equality predicates resolve to u32 dict_id comparisons
Grouped aggregation: SELECT exchange, COUNT(*) FROM t GROUP BY exchange uses dictionary-indexed accumulator arrays for text columns (zero hashing in the inner loop) and FxHashMap for integer/timestamp columns
DISTINCT: SELECT DISTINCT exchange FROM t extracts unique values from per-volume dictionary metadata without scanning rows
ORDER BY PK + LIMIT: SELECT * FROM t ORDER BY id LIMIT 10 uses a k-way merge across sorted volume row_ids and the hot buffer B-tree, stopping after the requested rows
IN list pruning: WHERE id IN (1, 2, 3) derives min/max bounds from the IN values for zone-map pruning
Zone map pruning: Per-column min/max metadata per volume and per 64K row group. Volumes that cannot match the WHERE clause are skipped entirely
Bloom filters: Per-column bloom filters for fast equality rejection on text columns
Binary search: Sorted columns (like the primary key) use binary search to narrow scan ranges within a volume
Parallel scanning: When the parallel feature is enabled (default), multiple volumes are scanned concurrently using rayon. A threshold guard ensures small queries use the sequential path to avoid scheduling overhead
OFFSET skip: Large OFFSET values skip row materialization for discarded rows, only constructing Row objects for rows that will be returned

Volume File Layout

/path/to/database/
  volumes/
    table_name/
      vol_00064d50e5946141.vol    # Sealed cold segment
      manifest.bin                # Segment metadata (volumes, tombstones, checkpoint LSN)

Volume files include a trailing CRC32 checksum for corruption detection. Bloom filters are serialized alongside the volume data (no rebuild on load).

Checkpoint Cycle

PRAGMA CHECKPOINT (and the periodic background cycle) executes:

Seal: Move hot buffer rows into new immutable .vol files (per-table seal fence ensures DML consistency)
Fence: Brief exclusive check that all hot buffers are empty, advance WAL checkpoint LSN
Re-record DDL: Write DDL entries after checkpoint LSN so they survive WAL truncation
Persist manifests: Write manifest.bin per table (volume list, tombstones, checkpoint LSN) atomically via fsync-before-rename
WAL truncate: Remove WAL entries before checkpoint LSN (only when all manifests are persisted)
Compact: Merge sub-target, oversized, and tombstoned volumes into target-sized outputs. Runs inline for explicit PRAGMA CHECKPOINT, offloaded to a background thread for the periodic automatic cycle.

Constraints and DML

Volume-backed tables enforce all constraints:

Primary key: Checked against cold segments using zone maps and binary search
Unique indexes: Checked against cold segments with zone map pruning, bloom filters, and per-volume hash indexes (built lazily on first lookup per column set, never invalidated since volumes are immutable)
UPDATE and DELETE: A tombstone is created for the cold row (deferred to commit). The new version goes to hot. The skip set handles dedup during scans
TRUNCATE and DROP TABLE: Cold segments and tombstones are cleared both in memory and on disk

Schema Changes

ALTER TABLE operations work on tables with frozen volumes. When the schema changes (ADD COLUMN, DROP COLUMN, etc.), volume data is normalized on read:

ADD COLUMN: New column returns NULL or the column’s DEFAULT value for volume rows. When rows are sealed or compacted, the default is materialized
DROP COLUMN: Dropped column is skipped during projection
RENAME COLUMN: Column matching is by name, so the renamed column maps correctly
MODIFY COLUMN: Type coercion is applied on read

No volume rebuild is needed. The next seal or compact cycle produces new segments with the updated schema.

Memory Model

Mode	Hot buffer	Cold segments
`memory://`	All data in arena	None (no persistence)
`file://` (small tables)	All data in arena	None (below seal threshold)
`file://` (large tables)	Recent rows only	Sealed historical data in columnar volumes

For large tables, the hot buffer contains only rows added or modified since the last seal. Historical data stays in cold segments and is merged during scans.

Backup Snapshots

How Snapshots Work

PRAGMA SNAPSHOT creates a full backup of all tables to .bin files. This is separate from the checkpoint cycle and is intended for manual backup and disaster recovery.

Snapshots capture:

All table schemas
All committed data (both hot buffer and cold volume rows)
Point-in-time consistency via commit sequence cutoff
Index and view definitions in per-timestamp ddl-{timestamp}.bin files (BTree, Hash, Bitmap, HNSW indexes and all views)

The DDL file is critical for timestamped restore: snapshot .bin files contain only row data and schema, not index or view definitions. Timestamped restore requires the matching ddl-{timestamp}.bin and fails if it is missing. Non-timestamped restore uses current in-memory definitions as a fallback.

Snapshot Configuration

-- Number of backup snapshots to retain per table (default: 3)
PRAGMA keep_snapshots = 3;

-- Manually create a backup snapshot
PRAGMA SNAPSHOT;

-- Restore from latest backup snapshot
PRAGMA RESTORE;

-- Restore from a specific backup
PRAGMA RESTORE = '20260315-100000.000';

Backup and restore are also available from the CLI:

# Create backup
stoolap -d "file:///path/to/db" --snapshot

# Restore (requires database to open successfully)
stoolap -d "file:///path/to/db" --restore

# Restore when volumes/manifests are corrupted (cleans up first)
stoolap -d "file:///path/to/db" --reset-volumes --restore

Snapshot Files

Snapshots are stored as binary files, organized per table:

/path/to/database/
  snapshots/
    snapshot_meta.bin               # Global snapshot metadata
    ddl-20240101-120000.000.bin    # Index/view definitions for this timestamp
    manifest-20240101-120000.000.json  # Table list for this snapshot batch
    table_name/
      snapshot-20240101-120000.000.bin
    other_table/
      snapshot-20240101-120000.000.bin

Each table has its own subdirectory. Filenames include the timestamp of creation. Old snapshots beyond keep_snapshots are automatically cleaned up.

Lock File

The database lock file (db.lock) uses OS-level file locking (flock) to prevent concurrent access. The lock file is not deleted on shutdown because flock protects the inode, not the pathname. Deleting the file while the lock is held would allow a race where another process creates a new inode and acquires its own lock. The stale file on disk is harmless and is re-locked on the next open.

Recovery Process

When opening a database, Stoolap performs recovery automatically:

Load Manifests + Volumes: For each table, load manifest.bin and cold volumes from volumes/<table>/
Migrate Legacy Snapshots (v0.3.7 only): If snapshots/ exists but volumes/ does not, automatic migration runs (see below)
Replay WAL: Apply operations logged after the checkpoint LSN
- INSERT with row_id already in a volume: skip (idempotent)
- INSERT with tombstoned row_id: apply to hot (post-seal UPDATE)
- UPDATE/DELETE: apply to hot buffer (creates shadow)
- Only committed transactions are applied
Rebuild Indexes: Populate hot indexes from recovered data
Sync Auto-increment: Ensure counters account for cold segment max row IDs
Post-recovery Seal: If hot buffer has rows from WAL replay, seal them immediately to cold volumes. This prevents query slowness from a large hot buffer. Manifests are persisted so the sealed data survives another crash.
Ready for Operations: Database is now available

Recovery is transparent to the application.

Migration from v0.3.7

Databases created with Stoolap v0.3.7 or earlier used a different persistence format based on snapshot .bin files. When opening such a database for the first time with a newer version:

The engine detects the snapshots/ directory and the absence of volumes/
All legacy snapshot data is loaded into the hot buffer
The data is immediately sealed into immutable cold volumes in volumes/<table>/
The old snapshots/ directory and snapshot_meta.bin are removed
On subsequent opens, only the volume-based format is used

Migration is automatic and one-time. You will see [migration] messages on stderr during this process. No user intervention is required.

Crash Safety

Scenario	Recovery	Data loss?
Crash during DML	WAL replay	None
Crash during checkpoint (before manifest)	Old manifest, WAL replays all	None
Crash during manifest write	Atomic rename: old or new	None
Crash after manifest, before WAL truncation	Volumes + redundant WAL, idempotent replay	None
Crash during compaction (before manifest)	Old volumes intact	None
Crash after compaction manifest	New volume, old files orphaned, cleaned up	None

One atomic state transition: the manifest write. Everything before is preparation (discardable). Everything after is cleanup (repeatable).

Recovery Example

// Opening automatically triggers recovery if needed
let db = Database::open("file:///path/to/database")?;

// Database is ready with all committed data restored
let results = db.query("SELECT * FROM users", ())?;

Configuration Options

Connection String Parameters

file:///path/to/database?sync_mode=2&checkpoint_interval=60&compact_threshold=4&keep_snapshots=3

Parameter	Description	Default	Runtime
sync_mode	WAL sync mode: none, normal, full (or 0, 1, 2)	normal	DSN only
sync_interval_ms	Minimum ms between syncs in normal mode	1000	DSN only
wal_flush_trigger	Buffer size in bytes before WAL flush	32768	DSN only
wal_buffer_size	WAL write buffer size in bytes	65536	DSN only
checkpoint_interval	Seconds between checkpoints (0 = disabled)	60	PRAGMA
compact_threshold	Sub-target volumes per table before merging	4	PRAGMA
target_volume_rows	Target rows per cold volume (min 65536)	1048576	PRAGMA
keep_snapshots	Backup snapshots to retain per table	3	PRAGMA
checkpoint_on_close	Seal all hot rows on clean shutdown	on	DSN only

Legacy parameter names are accepted for backward compatibility:

snapshot_interval maps to checkpoint_interval
snapshot_compression maps to compression

PRAGMA Commands

-- Read current settings
PRAGMA sync_mode;
PRAGMA checkpoint_interval;
PRAGMA compact_threshold;
PRAGMA keep_snapshots;
PRAGMA wal_flush_trigger;

-- Modify settings
PRAGMA sync_mode = 2;
PRAGMA checkpoint_interval = 60;
PRAGMA compact_threshold = 4;
PRAGMA keep_snapshots = 5;
PRAGMA wal_flush_trigger = 10000;

-- Create a backup snapshot
PRAGMA snapshot;

-- Run checkpoint cycle manually
PRAGMA checkpoint;

Best Practices

Durability vs Performance

Choose sync_mode based on your requirements:

Use Case	Recommended sync_mode	Write Throughput
Development/Testing	0 (None)	Highest. No fsync overhead. Data durable at checkpoint.
General Use (recommended)	1 (Normal)	High. Fsync at most once per second. ~1s durability gap.
Financial/Critical Data	2 (Full)	Lower. Fsync per WAL write. Zero data loss on crash.

Checkpoint Frequency

Frequent checkpoints (low interval): Faster recovery, more disk I/O
Infrequent checkpoints (high interval): Slower recovery, less disk I/O

For databases with high write rates, consider shorter intervals.

Disk Space

Monitor disk usage:

WAL files grow until the next checkpoint
Old volumes are compacted when threshold is exceeded
Old backup snapshots are retained based on keep_snapshots

Backup Strategy

For backups:

Create a manual snapshot: PRAGMA snapshot;
Copy the entire database directory while the database is idle
For hot backups, use filesystem snapshots (ZFS, LVM)

Directory Structure

A persistent database creates this directory structure:

/path/to/database/
  db.lock                              # Lock file for single-writer
  wal/
    wal_NNNNNN.log                     # WAL segment files
    checkpoint.meta                    # WAL truncation metadata
  volumes/
    table_name/
      vol_XXXXXXXXXXXXXXXX.vol         # Sealed cold segment
      manifest.bin                     # Segment metadata (volumes, tombstones, checkpoint LSN)
  snapshots/                           # Created by PRAGMA SNAPSHOT (backup)
    snapshot_meta.bin                   # Global snapshot metadata
    ddl-TIMESTAMP.bin                  # Index/view definitions per snapshot batch
    manifest-TIMESTAMP.json            # Table list per snapshot batch
    table_name/
      snapshot-TIMESTAMP.bin           # Per-table backup snapshot files

Error Handling

Corrupt WAL

If WAL corruption is detected during recovery:

Stoolap attempts to recover up to the last valid entry
Corrupted entries at the end are discarded
A warning is logged

Disk Full

If disk becomes full:

WAL writes will fail
Transactions will be rolled back
Free disk space before continuing

Lock Contention

Only one process can open a database directory:

A lock file (db.lock) prevents concurrent access
If a previous process crashed, the lock is automatically released on open

Example: Complete Configuration

use stoolap::Database;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Open with custom persistence settings
    let db = Database::open(
        "file:///var/lib/myapp/data?sync_mode=2&checkpoint_interval=120&keep_snapshots=5"
    )?;

    // Fine-tune at runtime
    db.execute("PRAGMA wal_flush_trigger = 5000", ())?;
    db.execute("PRAGMA compact_threshold = 8", ())?;

    // Your application logic...
    db.execute("CREATE TABLE events (id INTEGER PRIMARY KEY AUTO_INCREMENT, data JSON)", ())?;

    // Force a backup snapshot before maintenance
    db.execute("PRAGMA snapshot", ())?;

    Ok(())
}