SQLRite ships with a curated benchmark suite that pits the engine against SQLite (the reference) and optionally DuckDB (analytical-slice comparator) on a fixed set of OLTP, SQL-feature-scaling, and AI-era workloads. Numbers are produced on demand with make bench; raw JSON envelopes get committed under benchmarks/results/ so trends can be diffed mechanically.
The point isn't to "win" — SQLite has 25+ years of optimization behind it, and the ratios shipped here range from healthy single-digit gaps to known-broken multi-million-x outliers that surfaced honest engine TODOs (SQLR-18 / SQLR-19 / SQLR-20 / SQLR-21). The point is to (a) baseline so future engine work has a number to move, (b) prove the differentiator workloads (HNSW, BM25, hybrid retrieval) actually deliver, and (c) ground roadmap conversations about LSM / columnar / JIT directions with evidence rather than vibes.
For the design rationale (resolved Q1–Q8) see docs/benchmarks-plan.md. For the workload-by-workload narrative numbers see benchmarks/README.md.
- Quick start
- What the suite measures
- Reading the numbers
- Headline numbers
- Engineering debts surfaced
- Reproducing a run
- Comparing two runs
- What's NOT measured (and why)
- See also
# Lean profile — SQLRite + SQLite (rusqlite-bundled, WAL+NORMAL).
# ~5 min on an M-series MBP.
make bench
# Full profile — adds the optional DuckDB driver on Group B (W7/W8/W9).
# ~30 min on the same host (DuckDB's per-row INSERT path is heavy on
# the 1M-row Group B setup).
make bench-duckdbBoth commands run cargo bench -p sqlrite-benchmarks and then aggregate criterion's per-bench JSON into a single envelope under benchmarks/results/. Open target/criterion/report/index.html afterwards for the full criterion HTML report (per-bench p50/p95/p99 distributions, regression plots, etc.).
Twelve workloads, three groups. Each one has a fixed input dataset (deterministic seed), a correctness gate that runs once before timing starts, and a fixed criterion configuration. Workloads are versioned (W1.v1, W1.v2, …) so a future shape change is an explicit, traceable bump (Q8).
| ID | Name | Shape |
|---|---|---|
| W1 | Read-by-PK | 100k-row table, 10k random WHERE id = ? probes |
| W2 | Range scan | WHERE indexed_col >= ? AND <= ?, three width buckets (100 / 1k / 10k rows) |
| W3 | Bulk insert | 100k rows in one transaction |
| W4 | Single-row insert | 1k auto-committed INSERTs against a 1k-row preloaded table |
| W5 | Mixed OLTP | 50/50 SELECT-by-PK + UPDATE-by-PK on a 100k-row table |
| W6 | Index lookup | 10k probes by secondary = ? on a unique non-PK index |
| ID | Name | Shape |
|---|---|---|
| W7 | Aggregate | SELECT SUM(v) FROM big, 1M rows |
| W8 | GROUP BY | SELECT k, COUNT(*) FROM big GROUP BY k, three cardinalities (10 / 1k / 100k) |
| W9 | INNER JOIN | per-PK probe on customers ⨝ orders, 10k×10k tables (plan target was 100k×100k — see SQLR-20) |
| ID | Name | Shape |
|---|---|---|
| W10 | Vector top-10 | 10k 384-dim vectors, cosine top-10. Brute-force + HNSW variants. |
| W11 | BM25 top-10 | 1k-doc corpus, top-10 BM25. SQLite FTS5 as comparator. |
| W12 | Hybrid retrieval | 50/50 BM25 + cosine fusion on a 1k-doc corpus. SQLRite-only. |
W11 / W12 cap at 1k docs because of an engine constraint (SQLR-21); plan target was 10k.
A few methodology notes that change how you read the table.
Q3 — SQLite is run with the tuned profile. journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, cache_size=-65536 (64 MB). Rationale: SQLRite's WAL is mandatory + always-on, so SQLite-default (journal_mode=DELETE, synchronous=FULL, on-commit fsync) is not apples-to-apples. WAL+NORMAL matches SQLRite's commit fsync semantics. A SQLite-default secondary column is a future opt-in.
Q3 — DuckDB runs at defaults. No equivalent to SQLite's WAL+NORMAL knob; DuckDB's MVCC + commit semantics are uniform.
Parser tax — historical, addressed in SQLR-23. Pre-SQLR-23, the engine parsed SQL on every Connection::execute / Connection::prepare call. The bench driver's SQLRite path inlined ? placeholders into the SQL string, so every iteration also walked the full sqlparser AST. Several workloads' headline numbers were dominated by this overhead — W1 and W6 (sub-µs paths where parser cost was most of the per-iter time), W10 (the 384-dim bracket-array literal in ORDER BY was ~4 KB of SQL the parser walked every iteration; brute-force vs HNSW looked indistinguishable as a result).
SQLR-23 shipped:
Connection::prepare_cached— small per-connection LRU of parsed plans (default cap 16, matches rusqlite).Statement::query_with_params(&[Value])/Statement::execute_with_params(&[Value])— bind?placeholders at execute time without re-running sqlparser.Value::Vector(Vec<f32>)as a first-class bind type — the 4 KB query vector for W10 is now bound directly instead of being re-lexed every iteration. The HNSW probe optimizer still recognizes the bound shape, so the algorithmic shortcut keeps firing.
The bench harness Driver::query_one / query_all paths route through prepare_cached + the bound API. Every workload's WorkloadId.version was bumped v1 → v2 in lockstep — old JSON envelopes keep the v1 tag and stay readable, but cross-version comparisons require an explicit acknowledgment in the comparison script. The headline table below carries the v2 numbers from the post-SQLR-23 republished run (SQLR-25); the retired v1 baseline lives in the historical section underneath.
Where DuckDB is misleading. Per-PK-probe single-row OLTP queries (W9) are SQLite's home turf, not DuckDB's. The plan flags this as "apples-to-oranges"; we still publish the number because the directional comparison is informative.
Workload-shape capacity caps. Two workloads ship at smaller-than-plan dataset sizes because of engine constraints, both tracked as separate follow-ups (SQLR-20, SQLR-21). The deviations are documented inline in benchmarks/src/workloads/.
Median latency from the post-SQLR-23 pinned-host run — benchmarks/results/2026-05-08-apple-ac84d560.json, Apple M1 Pro / macOS 23.5.0, criterion defaults (3 s warm-up, 5 s measurement, 100 samples on light workloads / 10 samples on heavy ones — see the JSON envelope's per-sample samples field). Only medians here; the JSON carries 95 % CIs, mean, std-dev, ops/s.
| Workload | SQLRite | SQLite (WAL+NORMAL) | DuckDB | Notes |
|---|---|---|---|---|
| W1 read-by-PK | 3.92 µs | 2.09 µs | — | ~1.9× — gap closed by SQLR-23 (was ~4.8× in v1) |
| W2 range-100 | 24.27 ms | 66.62 µs | — | ~364× — full-scan vs index range probe |
| W2 range-1k | 26.64 ms | 649.30 µs | — | ~41× |
| W2 range-10k | 30.73 ms | 7.01 ms | — | ~4.4× — converges as scan dominates |
| W3 bulk insert (100k/txn) | 606.20 ms | 183.96 ms | — | ~3.3× — 100k INSERT plan parsed once, not per-row (was ~6.2× in v1) |
| W4 single-row insert | 6.57 ms | 11.35 µs | — | ~579× |
| W5 mixed OLTP | 58.00 ms | 9.65 µs | — | ~6,010× |
| W6 index lookup | 4.04 µs | 2.56 µs | — | ~1.6× — gap closed by SQLR-23 (was ~4.2× in v1) |
| W7 SUM (1M rows) | 103.62 ms | 31.57 ms | 478.78 µs | DuckDB ~66× faster than SQLite |
| W8 GROUP BY card-10 | 197.32 ms | 366.52 ms | 949.75 µs | DuckDB ~386× faster than SQLite |
| W8 GROUP BY card-1k | 1.380 s | 240.64 ms | 1.039 ms | DuckDB ~232× faster than SQLite |
| W8 GROUP BY card-100k | skipped | 239.72 ms | 22.93 ms | SQLRite skipped |
| W9 INNER JOIN (10k×10k) | 30.30 s | 2.16 µs | 484.97 µs | ~14M× |
| W10 vector top-10 (brute-force, 10k×384) | 120.88 ms | — | — | compute-bound; modest ~13% drop vs v1 |
| W10 vector top-10 (HNSW) | 2.40 ms | — | — | ~53× faster than v1 ⭐ — SQLR-23 + SQLR-28 unmasked the index; HNSW now ~50× faster than brute-force |
| W11 BM25 top-10 (1k docs) | 501.63 µs | 23.65 µs | — | ~21× — fts_match / bm25_score no longer re-parsed (was ~43× in v1) |
| W12 hybrid (1k docs) | 607.90 µs | — | — | RAG headline (~15% faster than v1) |
The canonical v2 run is
benchmarks/results/2026-05-08-apple-ac84d560.json. It supersedes the v1 baseline (table below) end-to-end: every workload was rerun on the same canonical Apple M1 Pro host after SQLR-23 bumpedWorkloadId.versionfromv1 → v2in lockstep (W10 →v3after SQLR-28 widened the HNSW probe to cosine + dot). Thedirty=trueflag reflects the working-tree state at run time (this doc update + the new envelope itself uncommitted); the measurements themselves only depend on the bench binary, which was built from the cleanac84d560tip. Subsequent official runs land alongside this file with their own date / host / commit.
The pre-SQLR-23 baseline from benchmarks/results/2026-05-07-apple-9ffd55a5.json, retained so the methodology shift is visible. The v1→v2 jump is not an algorithmic improvement — it's the bench-driver methodology change (per-iter inline_params → prepare_cached + bound ? parameters; Value::Vector for HNSW-eligible KNN). Cross-version comparisons (W1.v1 vs W1.v2) are flagged in the comparison script per Q8; the compare.py v1↔v2 report walks each one.
| Workload | SQLRite (v1) | SQLite (v1) | DuckDB (v1) |
|---|---|---|---|
| W1 read-by-PK | 9.87 µs | 2.05 µs | — |
| W2 range-100 | 23.99 ms | 60.50 µs | — |
| W2 range-1k | 24.92 ms | 585.21 µs | — |
| W2 range-10k | 30.15 ms | 6.24 ms | — |
| W3 bulk insert (100k/txn) | 1.029 s | 166.43 ms | — |
| W4 single-row insert | 6.76 ms | 9.78 µs | — |
| W5 mixed OLTP | 55.63 ms | 9.96 µs | — |
| W6 index lookup | 10.45 µs | 2.50 µs | — |
| W7 SUM (1M rows) | 109.47 ms | 31.14 ms | 468.74 µs |
| W8 GROUP BY card-10 | 201.80 ms | 438.09 ms | 761.40 µs |
| W8 GROUP BY card-1k | 1.372 s | 251.13 ms | 871.80 µs |
| W8 GROUP BY card-100k | skipped | 238.96 ms | 19.58 ms |
| W9 INNER JOIN (10k×10k) | 34.25 s | 2.23 µs | 699.23 µs |
| W10 brute-force | 138.66 ms | — | — |
| W10 HNSW | 126.81 ms | — | — |
| W11 BM25 top-10 (1k docs) | 1.079 ms | 25.03 µs | — |
| W12 hybrid (1k docs) | 713.53 µs | — | — |
Every bench-suite-found gap that exceeds the plan's "informational, not a gate" heuristic gets its own task. As of v1 of the suite:
| Task | Workload | Symptom | Likely root cause |
|---|---|---|---|
| SQLR-17 | (CI infra) | desktop-build apt-get hung 39 min on the 9.1 PR |
Azure-side runner / mirror flake — only act if recurs |
| SQLR-18 | W4 single-row INSERT | ~673× vs SQLite | Bottom-up B-tree rebuild on every COMMIT (CLAUDE.md "B-tree commit strategy") |
| SQLR-19 | W8 GROUP BY 100k-cardinality | ~245 s/iter (skipped by default) | Suspected Vec-backed group store — should be HashMap |
| SQLR-20 | W9 INNER JOIN | ~14M× at 10k×10k; intractable at the 100k×100k plan target | Nested-loop driver doesn't push ON predicate to inner-side index probe |
| SQLR-21 | W11 / W12 corpus cap | FTS doc-lengths sidecar must fit in one 4 KiB page (~1,360-doc cap) | Phase 8.1 — overflow chaining for posting + sidecar cells |
All five are "investigation, not a release gate" — the suite ships with the gap measured + the workaround documented inline + the task linked. Each task carries a reproducer and a sketch of the fix.
# Default — lean (SQLRite + SQLite), criterion defaults
# (3s warmup, 5s measurement, 100 samples), heavy workloads
# capped at 10 samples.
make bench
# Full — adds DuckDB on Group B
make bench-duckdb
# Sharpen estimates on heavy workloads (override the per-group
# sample_size cap):
cargo bench -p sqlrite-benchmarks --bench suite -- \
--measurement-time 30 --sample-size 30 'W3'
# Force-include SQLRite × W8/card-100k (default-skipped — SQLR-19):
SQLRITE_BENCH_W8_CARD_100K_SQLRITE=1 make benchThe aggregator picks an output filename based on host + commit short-SHA: benchmarks/results/<YYYY-MM-DD>-<host_token>-<short_sha>.json. Override with OUTPUT=path/to.json scripts/run.sh.
Local results are gitignored (benchmarks/results/.gitignore); only the pinned-host "official" runs get committed.
benchmarks/scripts/compare.py \
benchmarks/results/2026-05-07-applem1pro-aaaaaaaa.json \
benchmarks/results/2026-06-01-applem1pro-bbbbbbbb.json \
--md /tmp/diff.mdReads two JSON envelopes, matches samples by (workload, driver), computes per-workload percent change + ratio, and prints a Markdown table to stdout (or --md OUT.md). Same-version-only by Q8 — cross-version pairs land in their own "ignored" section. Cross-host pairs get a header warning; the script still runs but the numbers shouldn't be trusted as a true delta.
Pure stdlib Python — no third-party deps.
- CPU%. Noisy on a shared machine; redundant with wall-clock for single-threaded workloads.
- Concurrency curves. Engine is single-writer by design (Phase 4e). No concurrent-writer workload is meaningful until that changes.
- Network I/O. All targets are in-process. Cloudflare D1 / rqlite are explicitly out of scope per the plan's viability section — they're network-dominated and would measure latency, not engine throughput.
- libSQL. Deferred — its embedded SQL surface tracks SQLite within a few percent, so a third row-oriented OLTP comparator would mostly add noise. Worth revisiting alongside a vector-only benchmark page when sqlite-vec / libSQL native vector indexes become a useful comparison axis.
- fsync count on macOS.
/proc/self/ioexists on Linux only. On macOS the equivalent would need dtrace; out of scope for v1.
docs/benchmarks-plan.md— design rationale + the resolved Q1–Q8 decisions.benchmarks/README.md— workload-by-workload narrative.benchmarks/src/workloads/— one file per workload; methodology notes inline.benchmarks/results/— committed pinned-host runs.docs/architecture.md— engine layer map; benchmarks bind to the publicConnectionsurface only.docs/roadmap.md— broader project roadmap.