Benchmarks

SQLRite ships with a curated benchmark suite that pits the engine against SQLite (the reference) and optionally DuckDB (analytical-slice comparator) on a fixed set of OLTP, SQL-feature-scaling, and AI-era workloads. Numbers are produced on demand with make bench; raw JSON envelopes get committed under benchmarks/results/ so trends can be diffed mechanically.

The point isn't to "win" — SQLite has 25+ years of optimization behind it, and the ratios shipped here range from healthy single-digit gaps to known-broken multi-million-x outliers that surfaced honest engine TODOs (SQLR-18 / SQLR-19 / SQLR-20 / SQLR-21). The point is to (a) baseline so future engine work has a number to move, (b) prove the differentiator workloads (HNSW, BM25, hybrid retrieval) actually deliver, and (c) ground roadmap conversations about LSM / columnar / JIT directions with evidence rather than vibes.

For the design rationale (resolved Q1–Q8) see docs/benchmarks-plan.md. For the workload-by-workload narrative numbers see benchmarks/README.md.

Quick start
What the suite measures
Reading the numbers
Headline numbers
Engineering debts surfaced
Reproducing a run
Comparing two runs
What's NOT measured (and why)
See also

Quick start

# Lean profile — SQLRite + SQLite (rusqlite-bundled, WAL+NORMAL).
# ~5 min on an M-series MBP.
make bench

# Full profile — adds the optional DuckDB driver on Group B (W7/W8/W9).
# ~30 min on the same host (DuckDB's per-row INSERT path is heavy on
# the 1M-row Group B setup).
make bench-duckdb

Both commands run cargo bench -p sqlrite-benchmarks and then aggregate criterion's per-bench JSON into a single envelope under benchmarks/results/. Open target/criterion/report/index.html afterwards for the full criterion HTML report (per-bench p50/p95/p99 distributions, regression plots, etc.).

What the suite measures

Twelve workloads, three groups. Each one has a fixed input dataset (deterministic seed), a correctness gate that runs once before timing starts, and a fixed criterion configuration. Workloads are versioned (W1.v1, W1.v2, …) so a future shape change is an explicit, traceable bump (Q8).

Group A — OLTP baseline

ID	Name	Shape
W1	Read-by-PK	100k-row table, 10k random `WHERE id = ?` probes
W2	Range scan	`WHERE indexed_col >= ? AND <= ?`, three width buckets (100 / 1k / 10k rows)
W3	Bulk insert	100k rows in one transaction
W4	Single-row insert	1k auto-committed INSERTs against a 1k-row preloaded table
W5	Mixed OLTP	50/50 SELECT-by-PK + UPDATE-by-PK on a 100k-row table
W6	Index lookup	10k probes by `secondary = ?` on a unique non-PK index

Group B — SQL-feature scaling

ID	Name	Shape
W7	Aggregate	`SELECT SUM(v) FROM big`, 1M rows
W8	GROUP BY	`SELECT k, COUNT(*) FROM big GROUP BY k`, three cardinalities (10 / 1k / 100k)
W9	INNER JOIN	per-PK probe on `customers ⨝ orders`, 10k×10k tables (plan target was 100k×100k — see SQLR-20)

Group C — Differentiators

ID	Name	Shape
W10	Vector top-10	10k 384-dim vectors, cosine top-10. Brute-force + HNSW variants.
W11	BM25 top-10	1k-doc corpus, top-10 BM25. SQLite FTS5 as comparator.
W12	Hybrid retrieval	50/50 BM25 + cosine fusion on a 1k-doc corpus. SQLRite-only.

W11 / W12 cap at 1k docs because of an engine constraint (SQLR-21); plan target was 10k.

Reading the numbers

A few methodology notes that change how you read the table.

Q3 — SQLite is run with the tuned profile. journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, cache_size=-65536 (64 MB). Rationale: SQLRite's WAL is mandatory + always-on, so SQLite-default (journal_mode=DELETE, synchronous=FULL, on-commit fsync) is not apples-to-apples. WAL+NORMAL matches SQLRite's commit fsync semantics. A SQLite-default secondary column is a future opt-in.

Q3 — DuckDB runs at defaults. No equivalent to SQLite's WAL+NORMAL knob; DuckDB's MVCC + commit semantics are uniform.

Parser tax — historical, addressed in SQLR-23. Pre-SQLR-23, the engine parsed SQL on every Connection::execute / Connection::prepare call. The bench driver's SQLRite path inlined ? placeholders into the SQL string, so every iteration also walked the full sqlparser AST. Several workloads' headline numbers were dominated by this overhead — W1 and W6 (sub-µs paths where parser cost was most of the per-iter time), W10 (the 384-dim bracket-array literal in ORDER BY was ~4 KB of SQL the parser walked every iteration; brute-force vs HNSW looked indistinguishable as a result).

SQLR-23 shipped:

Connection::prepare_cached — small per-connection LRU of parsed plans (default cap 16, matches rusqlite).
Statement::query_with_params(&[Value]) / Statement::execute_with_params(&[Value]) — bind ? placeholders at execute time without re-running sqlparser.
Value::Vector(Vec<f32>) as a first-class bind type — the 4 KB query vector for W10 is now bound directly instead of being re-lexed every iteration. The HNSW probe optimizer still recognizes the bound shape, so the algorithmic shortcut keeps firing.

The bench harness Driver::query_one / query_all paths route through prepare_cached + the bound API. Every workload's WorkloadId.version was bumped v1 → v2 in lockstep — old JSON envelopes keep the v1 tag and stay readable, but cross-version comparisons require an explicit acknowledgment in the comparison script. The headline table below carries the v2 numbers from the post-SQLR-23 republished run (SQLR-25); the retired v1 baseline lives in the historical section underneath.

Where DuckDB is misleading. Per-PK-probe single-row OLTP queries (W9) are SQLite's home turf, not DuckDB's. The plan flags this as "apples-to-oranges"; we still publish the number because the directional comparison is informative.

Workload-shape capacity caps. Two workloads ship at smaller-than-plan dataset sizes because of engine constraints, both tracked as separate follow-ups (SQLR-20, SQLR-21). The deviations are documented inline in benchmarks/src/workloads/.

Headline numbers

Median latency from the post-SQLR-23 pinned-host run — benchmarks/results/2026-05-08-apple-ac84d560.json, Apple M1 Pro / macOS 23.5.0, criterion defaults (3 s warm-up, 5 s measurement, 100 samples on light workloads / 10 samples on heavy ones — see the JSON envelope's per-sample samples field). Only medians here; the JSON carries 95 % CIs, mean, std-dev, ops/s.

Workload	SQLRite	SQLite (WAL+NORMAL)	DuckDB	Notes
W1 read-by-PK	3.92 µs	2.09 µs	—	~1.9× — gap closed by SQLR-23 (was ~4.8× in v1)
W2 range-100	24.27 ms	66.62 µs	—	~364× — full-scan vs index range probe
W2 range-1k	26.64 ms	649.30 µs	—	~41×
W2 range-10k	30.73 ms	7.01 ms	—	~4.4× — converges as scan dominates
W3 bulk insert (100k/txn)	606.20 ms	183.96 ms	—	~3.3× — 100k INSERT plan parsed once, not per-row (was ~6.2× in v1)
W4 single-row insert	6.57 ms	11.35 µs	—	~579× ⚠️ SQLR-18
W5 mixed OLTP	58.00 ms	9.65 µs	—	~6,010× ⚠️ SQLR-18
W6 index lookup	4.04 µs	2.56 µs	—	~1.6× — gap closed by SQLR-23 (was ~4.2× in v1)
W7 SUM (1M rows)	103.62 ms	31.57 ms	478.78 µs	DuckDB ~66× faster than SQLite
W8 GROUP BY card-10	197.32 ms	366.52 ms	949.75 µs	DuckDB ~386× faster than SQLite
W8 GROUP BY card-1k	1.380 s	240.64 ms	1.039 ms	DuckDB ~232× faster than SQLite
W8 GROUP BY card-100k	skipped	239.72 ms	22.93 ms	SQLRite skipped ⚠️ SQLR-19; DuckDB ~10× faster than SQLite
W9 INNER JOIN (10k×10k)	30.30 s	2.16 µs	484.97 µs	~14M× ⚠️ SQLR-20; DuckDB ~225× slower than SQLite (analytical-engine OLTP weakness)
W10 vector top-10 (brute-force, 10k×384)	120.88 ms	—	—	compute-bound; modest ~13% drop vs v1
W10 vector top-10 (HNSW)	2.40 ms	—	—	~53× faster than v1 ⭐ — SQLR-23 + SQLR-28 unmasked the index; HNSW now ~50× faster than brute-force
W11 BM25 top-10 (1k docs)	501.63 µs	23.65 µs	—	~21× — `fts_match` / `bm25_score` no longer re-parsed (was ~43× in v1)
W12 hybrid (1k docs)	607.90 µs	—	—	RAG headline (~15% faster than v1)

The canonical v2 run is benchmarks/results/2026-05-08-apple-ac84d560.json. It supersedes the v1 baseline (table below) end-to-end: every workload was rerun on the same canonical Apple M1 Pro host after SQLR-23 bumped WorkloadId.version from v1 → v2 in lockstep (W10 → v3 after SQLR-28 widened the HNSW probe to cosine + dot). The dirty=true flag reflects the working-tree state at run time (this doc update + the new envelope itself uncommitted); the measurements themselves only depend on the bench binary, which was built from the clean ac84d560 tip. Subsequent official runs land alongside this file with their own date / host / commit.

Historical (v1, retired)

The pre-SQLR-23 baseline from benchmarks/results/2026-05-07-apple-9ffd55a5.json, retained so the methodology shift is visible. The v1→v2 jump is not an algorithmic improvement — it's the bench-driver methodology change (per-iter inline_params → prepare_cached + bound ? parameters; Value::Vector for HNSW-eligible KNN). Cross-version comparisons (W1.v1 vs W1.v2) are flagged in the comparison script per Q8; the compare.py v1↔v2 report walks each one.

Workload	SQLRite (v1)	SQLite (v1)	DuckDB (v1)
W1 read-by-PK	9.87 µs	2.05 µs	—
W2 range-100	23.99 ms	60.50 µs	—
W2 range-1k	24.92 ms	585.21 µs	—
W2 range-10k	30.15 ms	6.24 ms	—
W3 bulk insert (100k/txn)	1.029 s	166.43 ms	—
W4 single-row insert	6.76 ms	9.78 µs	—
W5 mixed OLTP	55.63 ms	9.96 µs	—
W6 index lookup	10.45 µs	2.50 µs	—
W7 SUM (1M rows)	109.47 ms	31.14 ms	468.74 µs
W8 GROUP BY card-10	201.80 ms	438.09 ms	761.40 µs
W8 GROUP BY card-1k	1.372 s	251.13 ms	871.80 µs
W8 GROUP BY card-100k	skipped	238.96 ms	19.58 ms
W9 INNER JOIN (10k×10k)	34.25 s	2.23 µs	699.23 µs
W10 brute-force	138.66 ms	—	—
W10 HNSW	126.81 ms	—	—
W11 BM25 top-10 (1k docs)	1.079 ms	25.03 µs	—
W12 hybrid (1k docs)	713.53 µs	—	—

Engineering debts surfaced

Every bench-suite-found gap that exceeds the plan's "informational, not a gate" heuristic gets its own task. As of v1 of the suite:

Task	Workload	Symptom	Likely root cause
SQLR-17	(CI infra)	`desktop-build` apt-get hung 39 min on the 9.1 PR	Azure-side runner / mirror flake — only act if recurs
SQLR-18	W4 single-row INSERT	~673× vs SQLite	Bottom-up B-tree rebuild on every COMMIT (CLAUDE.md "B-tree commit strategy")
SQLR-19	W8 GROUP BY 100k-cardinality	~245 s/iter (skipped by default)	Suspected `Vec`-backed group store — should be `HashMap`
SQLR-20	W9 INNER JOIN	~14M× at 10k×10k; intractable at the 100k×100k plan target	Nested-loop driver doesn't push ON predicate to inner-side index probe
SQLR-21	W11 / W12 corpus cap	FTS doc-lengths sidecar must fit in one 4 KiB page (~1,360-doc cap)	Phase 8.1 — overflow chaining for posting + sidecar cells

All five are "investigation, not a release gate" — the suite ships with the gap measured + the workaround documented inline + the task linked. Each task carries a reproducer and a sketch of the fix.

Reproducing a run

# Default — lean (SQLRite + SQLite), criterion defaults
# (3s warmup, 5s measurement, 100 samples), heavy workloads
# capped at 10 samples.
make bench

# Full — adds DuckDB on Group B
make bench-duckdb

# Sharpen estimates on heavy workloads (override the per-group
# sample_size cap):
cargo bench -p sqlrite-benchmarks --bench suite -- \
    --measurement-time 30 --sample-size 30 'W3'

# Force-include SQLRite × W8/card-100k (default-skipped — SQLR-19):
SQLRITE_BENCH_W8_CARD_100K_SQLRITE=1 make bench

The aggregator picks an output filename based on host + commit short-SHA: benchmarks/results/<YYYY-MM-DD>-<host_token>-<short_sha>.json. Override with OUTPUT=path/to.json scripts/run.sh.

Local results are gitignored (benchmarks/results/.gitignore); only the pinned-host "official" runs get committed.

Comparing two runs

benchmarks/scripts/compare.py \
    benchmarks/results/2026-05-07-applem1pro-aaaaaaaa.json \
    benchmarks/results/2026-06-01-applem1pro-bbbbbbbb.json \
    --md /tmp/diff.md

Reads two JSON envelopes, matches samples by (workload, driver), computes per-workload percent change + ratio, and prints a Markdown table to stdout (or --md OUT.md). Same-version-only by Q8 — cross-version pairs land in their own "ignored" section. Cross-host pairs get a header warning; the script still runs but the numbers shouldn't be trusted as a true delta.

Pure stdlib Python — no third-party deps.

What's NOT measured (and why)

CPU%. Noisy on a shared machine; redundant with wall-clock for single-threaded workloads.
Concurrency curves. Engine is single-writer by design (Phase 4e). No concurrent-writer workload is meaningful until that changes.
Network I/O. All targets are in-process. Cloudflare D1 / rqlite are explicitly out of scope per the plan's viability section — they're network-dominated and would measure latency, not engine throughput.
libSQL. Deferred — its embedded SQL surface tracks SQLite within a few percent, so a third row-oriented OLTP comparator would mostly add noise. Worth revisiting alongside a vector-only benchmark page when sqlite-vec / libSQL native vector indexes become a useful comparison axis.
fsync count on macOS. /proc/self/io exists on Linux only. On macOS the equivalent would need dtrace; out of scope for v1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Table of contents

Quick start

What the suite measures

Group A — OLTP baseline

Group B — SQL-feature scaling

Group C — Differentiators

Reading the numbers

Headline numbers

Historical (v1, retired)

Engineering debts surfaced

Reproducing a run

Comparing two runs

What's NOT measured (and why)

See also

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmarks

Table of contents

Quick start

What the suite measures

Group A — OLTP baseline

Group B — SQL-feature scaling

Group C — Differentiators

Reading the numbers

Headline numbers

Historical (v1, retired)

Engineering debts surfaced

Reproducing a run

Comparing two runs

What's NOT measured (and why)

See also