HNSW vector search performance on public ANN-Benchmarks datasets
ANN Benchmarks
This page presents Stoolap’s HNSW vector search performance on a public dataset from the ANN-Benchmarks project. We use the same dataset, the same HNSW parameters, and compute exact ground truth independently, measuring the full SQL query path (parsing, planning, index lookup, result assembly) on a single CPU core. See Methodology for details.
Dataset
We use Fashion-MNIST, the same data behind fashion-mnist-784-euclidean on ANN-Benchmarks:
| Property | Value |
|---|---|
| Base vectors | 60,000 |
| Dimensions | 784 |
| Distance metric | L2 (Euclidean) |
| Query vectors | 10,000 (full test set) |
| k (neighbors returned) | 10 |
| Ground truth | Exact brute-force KNN computed independently in Rust |
Each benchmark query executes a full SQL path:
SELECT id, VEC_DISTANCE_L2(embedding, ?) AS dist
FROM vectors ORDER BY dist LIMIT 10
Methodology
Metrics
- Recall@k: Fraction of true k-nearest neighbors found (ID-set intersection). 99% recall means 9.9 out of 10 true neighbors are returned.
- QPS: Queries per second, computed as
query_count / best_run_total_time. Higher is better. - p95 / p99 latency: 95th and 99th percentile query time in milliseconds (from the best run). Lower is better.
- Speedup: HNSW throughput divided by brute-force throughput.
Differences from ann-benchmarks.com
| Aspect | ann-benchmarks.com | Stoolap benchmark |
|---|---|---|
| What is timed | Raw index query | Full SQL path (parse, plan, index lookup, result assembly) |
| Threading | Single CPU | Single CPU (RAYON_NUM_THREADS=1) |
| Runs | Best of 5 | HNSW: best of 5, brute-force: single run |
| Query count | All 10,000 | All 10,000 |
| Recall method | Distance-threshold | ID-set intersection |
| Ground truth | Pre-computed from dataset | Computed independently via exact brute-force L2 |
| Warmup | Implicit (best-of-N) | Explicit 10-query warmup + best-of-N |
| HNSW parameters | Identical | Identical (ef_construction=500, same m and ef_search grid) |
Our QPS numbers include SQL overhead (parsing, planning, result assembly) and are therefore not directly comparable to ann-benchmarks.com results, which measure raw index operations only.
Results
Scorecard
Best configuration for each recall target (single-core, 10K queries, HNSW best-of-5, brute-force single run):
| Recall target | Config | Measured recall | QPS | p95 | p99 | Speedup |
|---|---|---|---|---|---|---|
| >= 95.0% | m=8, ef=20 | 95.02% | 10,410 | 0.12 ms | 0.15 ms | 733x |
| >= 99.0% | m=24, ef=10 | 99.31% | 6,700 | 0.19 ms | 0.22 ms | 472x |
| >= 99.5% | m=16, ef=40 | 99.69% | 5,762 | 0.22 ms | 0.25 ms | 406x |
| >= 99.9% | m=36, ef=40 | 99.91% | 4,159 | 0.33 ms | 0.38 ms | 293x |
| >= 99.99% | m=64, ef=120 | 99.99% | 1,962 | 0.77 ms | 0.90 ms | 138x |
| 100% | m=48, ef=600 | 100.00% | 913 | 1.59 ms | 1.84 ms | 64x |
Brute-force baseline: 14.2 QPS, p95 = 70.96 ms (single-core, full SQL path).
Recommended Configurations
| Use case | Config | Recall | QPS | Latency (p95) |
|---|---|---|---|---|
| Low latency | m=12, ef_search=10 | 98.0% | 8,819 | 0.14 ms |
| Balanced | m=36, ef_search=40 | 99.9% | 4,159 | 0.33 ms |
| High accuracy | m=48, ef_search=200 | ~100% | 1,676 | 0.89 ms |
For most workloads, m=12 with ef_search between 10 and 40 provides excellent throughput with over 98% recall. Increase m to 36 and ef_search to 40 when 99.9% recall is needed. For perfect 100% recall, m=48 with ef_search=600 achieves 913 QPS.
Parameter Exploration
How each m value behaves across the ef_search sweep:
| m | Build time | Best QPS | Best QPS with recall >= 99.9% | 100% recall |
|---|---|---|---|---|
| 4 | 18.5 s | 13,566 QPS (ef=10, 82.4%) | n/a | n/a |
| 8 | 25.8 s | 10,410 QPS (ef=20, 95.0%) | 2,100 QPS (ef=400) | n/a |
| 12 | 31.0 s | 8,819 QPS (ef=10, 98.0%) | 3,595 QPS (ef=120) | n/a |
| 16 | 34.3 s | 7,733 QPS (ef=10, 98.8%) | 3,214 QPS (ef=120) | n/a |
| 24 | 38.1 s | 6,700 QPS (ef=10, 99.3%) | 3,478 QPS (ef=80) | n/a |
| 36 | 41.7 s | 5,621 QPS (ef=10, 99.7%) | 4,159 QPS (ef=40) | n/a |
| 48 | 45.2 s | 4,927 QPS (ef=10, 99.8%) | 3,690 QPS (ef=40) | 913 QPS (ef=600) |
| 64 | 49.2 s | 4,301 QPS (ef=10, 99.8%) | 3,244 QPS (ef=40) | 838 QPS (ef=600) |
| 96 | 56.0 s | 3,455 QPS (ef=10, 99.9%) | 2,615 QPS (ef=40) | 725 QPS (ef=600) |
Lower m values (4, 8) give the fastest queries with lower recall. The sweet spot for 99.9% recall throughput is m=36 with ef=40, achieving 4,159 QPS. For 100% recall, m=48 is optimal (913 QPS at ef=600), beating m=64 and m=96 despite lower graph connectivity.
Note: The query executor uses max(2*k, ef_search) as the effective search beam width. With k=10, ef_search values below 20 all produce an effective ef of 20. This is why ef=10 and ef=20 yield nearly the same recall for each m value.
Environment
| Item | Value |
|---|---|
| Date | 2026-02-28 |
| OS | Darwin 25.1.0 arm64 |
| CPU | Apple M4 (single core, RAYON_NUM_THREADS=1) |
| RAM | 16 GiB |
| Engine | stoolap v0.3.3 |
| Rust | rustc 1.92.0 |
| Sweep m | 4, 8, 12, 16, 24, 36, 48, 64, 96 |
| Sweep ef_search | 10, 20, 40, 80, 120, 200, 400, 600, 800 |
| ef_construction | 500 (matches ann-benchmarks.com) |
| Queries per configuration | 10,000 |
| Runs per configuration | HNSW: 5 (best-of-5), brute-force: 1 |
| Measured configurations | 81 |
Limitations
- Not directly comparable to ann-benchmarks.com: Our benchmark measures full SQL query time (including parsing and planning), not raw index operations. Our QPS numbers include overhead that pure index benchmarks avoid.
- Single machine: All measurements are from a single Apple M4 machine. Results will differ on other hardware.
- No cross-engine comparison: This report shows Stoolap’s absolute performance on a standard dataset. We have not run other engines with the same harness on identical hardware.
Reproduce
# Self-contained: downloads data, computes ground truth, runs benchmark
RAYON_NUM_THREADS=1 cargo run --release --example ann_benchmark \
--features ann-benchmark -- --sweep --runs 5 --max-queries 10000 \
--csv sweep-results.csv
See examples/ann_benchmark.rs for the full source.
References
- ANN-Benchmarks by Erik Bernhardsson et al.
- Dataset: Fashion-MNIST (same data as fashion-mnist-784-euclidean on ANN-Benchmarks)
- Stoolap benchmark source:
examples/ann_benchmark.rs - Blog post: Vector and Semantic Search in SQL