Query Optimizer

Stoolap includes a sophisticated cost-based query optimizer that analyzes queries and chooses efficient execution plans. The optimizer uses table statistics, index information, and cost models to make intelligent decisions.

Overview

The query optimizer performs several tasks:

Cost estimation: Calculates the expected cost of different execution strategies
Access method selection: Chooses between table scans, index lookups, and primary key access
Join ordering: Determines the optimal order to join multiple tables
Join algorithm selection: Chooses Hash Join, Merge Join, or Nested Loop Join
Predicate pushdown: Moves filters close to data sources
Expression simplification: Optimizes boolean and arithmetic expressions

Cost Model

Cost Components

The optimizer considers two main cost components:

I/O Cost: Reading data from storage
- Sequential page reads (table scans)
- Random page reads (index access)
CPU Cost: Processing data
- Tuple processing
- Predicate evaluation
- Join operations

Access Methods

The optimizer chooses from these access methods:

Method	Use Case	Cost
Primary Key Lookup	WHERE pk = value	Lowest (O(1))
Index Scan	WHERE indexed_col = value	Low (O(log n))
Index Range Scan	WHERE indexed_col > value	Medium
Sequential Scan	No usable index	Highest (O(n))

Example

-- The optimizer evaluates multiple strategies
SELECT * FROM users WHERE id = 123;

-- Strategy 1: Sequential scan - O(n) rows
-- Strategy 2: PK lookup - O(1)
-- Winner: PK lookup (much lower cost)

Table Statistics

The optimizer uses statistics collected by ANALYZE to make better decisions.

ANALYZE Command

-- Collect statistics for a table
ANALYZE users;

-- Statistics are stored in system tables
SELECT * FROM _sys_table_stats WHERE table_name = 'users';
SELECT * FROM _sys_column_stats WHERE table_name = 'users';

Statistics Collected

Statistic	Description
Row count	Total number of rows in the table
Distinct count	Number of unique values per column
NULL count	Number of NULL values per column
Min/Max values	Range of values per column
Histograms	Distribution of values for selectivity estimation

Selectivity Estimation

Statistics help estimate how selective predicates are:

-- Without statistics: assume 10% selectivity
-- With statistics: use actual data distribution
SELECT * FROM orders WHERE status = 'pending';

Join Optimization

Join Ordering

For queries joining multiple tables, the optimizer uses dynamic programming to find the optimal join order:

-- The optimizer considers all possible join orders
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

-- Possible orders:
-- 1. (orders JOIN customers) JOIN products
-- 2. (orders JOIN products) JOIN customers
-- 3. (customers JOIN orders) JOIN products
-- etc.

Join Algorithms

The optimizer selects the best join algorithm:

Algorithm	Best For	Requirements
Hash Join	Large tables, equality joins	Memory for hash table
Merge Join	Pre-sorted inputs	Sorted data
Nested Loop	Small tables, any join condition	None

Semi-Join Optimization

For EXISTS and IN subqueries, the optimizer may use semi-join:

-- Can use semi-join optimization
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Adaptive Query Execution

Stoolap implements Adaptive Query Execution (AQE), which can adjust the execution plan at runtime based on actual data characteristics.

How AQE Works

Initial Plan: Optimizer creates plan using statistics
Runtime Monitoring: Actual row counts are tracked during execution
Plan Adjustment: If actual differs significantly from estimated, the plan may be adjusted

Cardinality Feedback

The optimizer learns from query execution:

-- First execution: uses estimated cardinality
SELECT * FROM orders WHERE amount > 1000;

-- Subsequent executions: uses feedback from actual row counts
-- This improves future estimates for similar queries

Bloom Filter Propagation

For join-heavy queries, the optimizer can use Bloom filters to reduce data movement:

-- Bloom filter can pre-filter orders before join
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'US';

The filter on customers.region creates a Bloom filter that filters orders early.

Expression Simplification

The optimizer simplifies expressions when possible:

Constant Folding

-- Before: SELECT * FROM t WHERE x > 1 + 1
-- After:  SELECT * FROM t WHERE x > 2

Boolean Optimization

-- Before: SELECT * FROM t WHERE true AND x > 5
-- After:  SELECT * FROM t WHERE x > 5

-- Before: SELECT * FROM t WHERE false OR x > 5
-- After:  SELECT * FROM t WHERE x > 5

Predicate Merging

-- Before: WHERE x > 5 AND x > 10
-- After:  WHERE x > 10

-- Before: WHERE x = 5 AND x = 5
-- After:  WHERE x = 5

Viewing Query Plans

EXPLAIN

Shows the planned execution strategy:

EXPLAIN SELECT * FROM users WHERE id = 1;

Output:

plan
----
SELECT
  Columns: *
  -> PK Lookup on users
       id = 1

EXPLAIN ANALYZE

Shows the plan with actual runtime statistics:

EXPLAIN ANALYZE SELECT * FROM users WHERE status = 'active';

Output:

plan
----
SELECT (actual time=2.5ms, rows=1500)
  Columns: *
  -> Seq Scan on users (actual rows=1500)
       Filter: (status = 'active')

Best Practices

Keep Statistics Updated

Run ANALYZE after significant data changes:

-- After bulk insert
INSERT INTO orders SELECT * FROM staging_orders;
ANALYZE orders;

Create Appropriate Indexes

The optimizer can only use indexes that exist:

-- Create index for common query patterns
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

Use EXPLAIN to Understand Plans

Before optimizing, understand current behavior:

-- Check what plan is being used
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;

Consider Query Patterns

Design indexes based on how queries filter and join:

-- If queries often filter by status and date together
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

Cost Constants

The optimizer uses these tunable constants:

Constant	Description	Default
cpu_tuple_cost	Cost to process one row	0.01
cpu_operator_cost	Cost to evaluate one predicate	0.0025
seq_page_cost	Cost for sequential I/O	1.0
random_page_cost	Cost for random I/O	2.0
pk_lookup_cost	Cost for primary key lookup	1.0

These values are tuned for in-memory operation and may differ from disk-based databases.