Skip to content

Latest commit

 

History

History
227 lines (163 loc) · 15.4 KB

File metadata and controls

227 lines (163 loc) · 15.4 KB

Benchmarks

SQLRite ships with a curated benchmark suite that pits the engine against SQLite (the reference) and optionally DuckDB (analytical-slice comparator) on a fixed set of OLTP, SQL-feature-scaling, and AI-era workloads. Numbers are produced on demand with make bench; raw JSON envelopes get committed under benchmarks/results/ so trends can be diffed mechanically.

The point isn't to "win" — SQLite has 25+ years of optimization behind it, and the ratios shipped here range from healthy single-digit gaps to known-broken multi-million-x outliers that surfaced honest engine TODOs (SQLR-18 / SQLR-19 / SQLR-20 / SQLR-21). The point is to (a) baseline so future engine work has a number to move, (b) prove the differentiator workloads (HNSW, BM25, hybrid retrieval) actually deliver, and (c) ground roadmap conversations about LSM / columnar / JIT directions with evidence rather than vibes.

For the design rationale (resolved Q1–Q8) see docs/benchmarks-plan.md. For the workload-by-workload narrative numbers see benchmarks/README.md.


Table of contents


Quick start

# Lean profile — SQLRite + SQLite (rusqlite-bundled, WAL+NORMAL).
# ~5 min on an M-series MBP.
make bench

# Full profile — adds the optional DuckDB driver on Group B (W7/W8/W9).
# ~30 min on the same host (DuckDB's per-row INSERT path is heavy on
# the 1M-row Group B setup).
make bench-duckdb

Both commands run cargo bench -p sqlrite-benchmarks and then aggregate criterion's per-bench JSON into a single envelope under benchmarks/results/. Open target/criterion/report/index.html afterwards for the full criterion HTML report (per-bench p50/p95/p99 distributions, regression plots, etc.).


What the suite measures

Twelve workloads, three groups. Each one has a fixed input dataset (deterministic seed), a correctness gate that runs once before timing starts, and a fixed criterion configuration. Workloads are versioned (W1.v1, W1.v2, …) so a future shape change is an explicit, traceable bump (Q8).

Group A — OLTP baseline

ID Name Shape
W1 Read-by-PK 100k-row table, 10k random WHERE id = ? probes
W2 Range scan WHERE indexed_col >= ? AND <= ?, three width buckets (100 / 1k / 10k rows)
W3 Bulk insert 100k rows in one transaction
W4 Single-row insert 1k auto-committed INSERTs against a 1k-row preloaded table
W5 Mixed OLTP 50/50 SELECT-by-PK + UPDATE-by-PK on a 100k-row table
W6 Index lookup 10k probes by secondary = ? on a unique non-PK index

Group B — SQL-feature scaling

ID Name Shape
W7 Aggregate SELECT SUM(v) FROM big, 1M rows
W8 GROUP BY SELECT k, COUNT(*) FROM big GROUP BY k, three cardinalities (10 / 1k / 100k)
W9 INNER JOIN per-PK probe on customers ⨝ orders, 10k×10k tables (plan target was 100k×100k — see SQLR-20)

Group C — Differentiators

ID Name Shape
W10 Vector top-10 10k 384-dim vectors, cosine top-10. Brute-force + HNSW variants.
W11 BM25 top-10 1k-doc corpus, top-10 BM25. SQLite FTS5 as comparator.
W12 Hybrid retrieval 50/50 BM25 + cosine fusion on a 1k-doc corpus. SQLRite-only.

W11 / W12 cap at 1k docs because of an engine constraint (SQLR-21); plan target was 10k.


Reading the numbers

A few methodology notes that change how you read the table.

Q3 — SQLite is run with the tuned profile. journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY, cache_size=-65536 (64 MB). Rationale: SQLRite's WAL is mandatory + always-on, so SQLite-default (journal_mode=DELETE, synchronous=FULL, on-commit fsync) is not apples-to-apples. WAL+NORMAL matches SQLRite's commit fsync semantics. A SQLite-default secondary column is a future opt-in.

Q3 — DuckDB runs at defaults. No equivalent to SQLite's WAL+NORMAL knob; DuckDB's MVCC + commit semantics are uniform.

Parser tax — historical, addressed in SQLR-23. Pre-SQLR-23, the engine parsed SQL on every Connection::execute / Connection::prepare call. The bench driver's SQLRite path inlined ? placeholders into the SQL string, so every iteration also walked the full sqlparser AST. Several workloads' headline numbers were dominated by this overhead — W1 and W6 (sub-µs paths where parser cost was most of the per-iter time), W10 (the 384-dim bracket-array literal in ORDER BY was ~4 KB of SQL the parser walked every iteration; brute-force vs HNSW looked indistinguishable as a result).

SQLR-23 shipped:

  • Connection::prepare_cached — small per-connection LRU of parsed plans (default cap 16, matches rusqlite).
  • Statement::query_with_params(&[Value]) / Statement::execute_with_params(&[Value]) — bind ? placeholders at execute time without re-running sqlparser.
  • Value::Vector(Vec<f32>) as a first-class bind type — the 4 KB query vector for W10 is now bound directly instead of being re-lexed every iteration. The HNSW probe optimizer still recognizes the bound shape, so the algorithmic shortcut keeps firing.

The bench harness Driver::query_one / query_all paths route through prepare_cached + the bound API. Every workload's WorkloadId.version was bumped v1 → v2 in lockstep — old JSON envelopes keep the v1 tag and stay readable, but cross-version comparisons require an explicit acknowledgment in the comparison script. The headline table below carries the v2 numbers from the post-SQLR-23 republished run (SQLR-25); the retired v1 baseline lives in the historical section underneath.

Where DuckDB is misleading. Per-PK-probe single-row OLTP queries (W9) are SQLite's home turf, not DuckDB's. The plan flags this as "apples-to-oranges"; we still publish the number because the directional comparison is informative.

Workload-shape capacity caps. Two workloads ship at smaller-than-plan dataset sizes because of engine constraints, both tracked as separate follow-ups (SQLR-20, SQLR-21). The deviations are documented inline in benchmarks/src/workloads/.


Headline numbers

Median latency from the post-SQLR-23 pinned-host run — benchmarks/results/2026-05-08-apple-ac84d560.json, Apple M1 Pro / macOS 23.5.0, criterion defaults (3 s warm-up, 5 s measurement, 100 samples on light workloads / 10 samples on heavy ones — see the JSON envelope's per-sample samples field). Only medians here; the JSON carries 95 % CIs, mean, std-dev, ops/s.

Workload SQLRite SQLite (WAL+NORMAL) DuckDB Notes
W1 read-by-PK 3.92 µs 2.09 µs ~1.9× — gap closed by SQLR-23 (was ~4.8× in v1)
W2 range-100 24.27 ms 66.62 µs ~364× — full-scan vs index range probe
W2 range-1k 26.64 ms 649.30 µs ~41×
W2 range-10k 30.73 ms 7.01 ms ~4.4× — converges as scan dominates
W3 bulk insert (100k/txn) 606.20 ms 183.96 ms ~3.3× — 100k INSERT plan parsed once, not per-row (was ~6.2× in v1)
W4 single-row insert 6.57 ms 11.35 µs ~579× ⚠️ SQLR-18
W5 mixed OLTP 58.00 ms 9.65 µs ~6,010× ⚠️ SQLR-18
W6 index lookup 4.04 µs 2.56 µs ~1.6× — gap closed by SQLR-23 (was ~4.2× in v1)
W7 SUM (1M rows) 103.62 ms 31.57 ms 478.78 µs DuckDB ~66× faster than SQLite
W8 GROUP BY card-10 197.32 ms 366.52 ms 949.75 µs DuckDB ~386× faster than SQLite
W8 GROUP BY card-1k 1.380 s 240.64 ms 1.039 ms DuckDB ~232× faster than SQLite
W8 GROUP BY card-100k skipped 239.72 ms 22.93 ms SQLRite skipped ⚠️ SQLR-19; DuckDB ~10× faster than SQLite
W9 INNER JOIN (10k×10k) 30.30 s 2.16 µs 484.97 µs ~14M× ⚠️ SQLR-20; DuckDB ~225× slower than SQLite (analytical-engine OLTP weakness)
W10 vector top-10 (brute-force, 10k×384) 120.88 ms compute-bound; modest ~13% drop vs v1
W10 vector top-10 (HNSW) 2.40 ms ~53× faster than v1 ⭐ — SQLR-23 + SQLR-28 unmasked the index; HNSW now ~50× faster than brute-force
W11 BM25 top-10 (1k docs) 501.63 µs 23.65 µs ~21× — fts_match / bm25_score no longer re-parsed (was ~43× in v1)
W12 hybrid (1k docs) 607.90 µs RAG headline (~15% faster than v1)

The canonical v2 run is benchmarks/results/2026-05-08-apple-ac84d560.json. It supersedes the v1 baseline (table below) end-to-end: every workload was rerun on the same canonical Apple M1 Pro host after SQLR-23 bumped WorkloadId.version from v1 → v2 in lockstep (W10 → v3 after SQLR-28 widened the HNSW probe to cosine + dot). The dirty=true flag reflects the working-tree state at run time (this doc update + the new envelope itself uncommitted); the measurements themselves only depend on the bench binary, which was built from the clean ac84d560 tip. Subsequent official runs land alongside this file with their own date / host / commit.

Historical (v1, retired)

The pre-SQLR-23 baseline from benchmarks/results/2026-05-07-apple-9ffd55a5.json, retained so the methodology shift is visible. The v1→v2 jump is not an algorithmic improvement — it's the bench-driver methodology change (per-iter inline_paramsprepare_cached + bound ? parameters; Value::Vector for HNSW-eligible KNN). Cross-version comparisons (W1.v1 vs W1.v2) are flagged in the comparison script per Q8; the compare.py v1↔v2 report walks each one.

Workload SQLRite (v1) SQLite (v1) DuckDB (v1)
W1 read-by-PK 9.87 µs 2.05 µs
W2 range-100 23.99 ms 60.50 µs
W2 range-1k 24.92 ms 585.21 µs
W2 range-10k 30.15 ms 6.24 ms
W3 bulk insert (100k/txn) 1.029 s 166.43 ms
W4 single-row insert 6.76 ms 9.78 µs
W5 mixed OLTP 55.63 ms 9.96 µs
W6 index lookup 10.45 µs 2.50 µs
W7 SUM (1M rows) 109.47 ms 31.14 ms 468.74 µs
W8 GROUP BY card-10 201.80 ms 438.09 ms 761.40 µs
W8 GROUP BY card-1k 1.372 s 251.13 ms 871.80 µs
W8 GROUP BY card-100k skipped 238.96 ms 19.58 ms
W9 INNER JOIN (10k×10k) 34.25 s 2.23 µs 699.23 µs
W10 brute-force 138.66 ms
W10 HNSW 126.81 ms
W11 BM25 top-10 (1k docs) 1.079 ms 25.03 µs
W12 hybrid (1k docs) 713.53 µs

Engineering debts surfaced

Every bench-suite-found gap that exceeds the plan's "informational, not a gate" heuristic gets its own task. As of v1 of the suite:

Task Workload Symptom Likely root cause
SQLR-17 (CI infra) desktop-build apt-get hung 39 min on the 9.1 PR Azure-side runner / mirror flake — only act if recurs
SQLR-18 W4 single-row INSERT ~673× vs SQLite Bottom-up B-tree rebuild on every COMMIT (CLAUDE.md "B-tree commit strategy")
SQLR-19 W8 GROUP BY 100k-cardinality ~245 s/iter (skipped by default) Suspected Vec-backed group store — should be HashMap
SQLR-20 W9 INNER JOIN ~14M× at 10k×10k; intractable at the 100k×100k plan target Nested-loop driver doesn't push ON predicate to inner-side index probe
SQLR-21 W11 / W12 corpus cap FTS doc-lengths sidecar must fit in one 4 KiB page (~1,360-doc cap) Phase 8.1 — overflow chaining for posting + sidecar cells

All five are "investigation, not a release gate" — the suite ships with the gap measured + the workaround documented inline + the task linked. Each task carries a reproducer and a sketch of the fix.


Reproducing a run

# Default — lean (SQLRite + SQLite), criterion defaults
# (3s warmup, 5s measurement, 100 samples), heavy workloads
# capped at 10 samples.
make bench

# Full — adds DuckDB on Group B
make bench-duckdb

# Sharpen estimates on heavy workloads (override the per-group
# sample_size cap):
cargo bench -p sqlrite-benchmarks --bench suite -- \
    --measurement-time 30 --sample-size 30 'W3'

# Force-include SQLRite × W8/card-100k (default-skipped — SQLR-19):
SQLRITE_BENCH_W8_CARD_100K_SQLRITE=1 make bench

The aggregator picks an output filename based on host + commit short-SHA: benchmarks/results/<YYYY-MM-DD>-<host_token>-<short_sha>.json. Override with OUTPUT=path/to.json scripts/run.sh.

Local results are gitignored (benchmarks/results/.gitignore); only the pinned-host "official" runs get committed.


Comparing two runs

benchmarks/scripts/compare.py \
    benchmarks/results/2026-05-07-applem1pro-aaaaaaaa.json \
    benchmarks/results/2026-06-01-applem1pro-bbbbbbbb.json \
    --md /tmp/diff.md

Reads two JSON envelopes, matches samples by (workload, driver), computes per-workload percent change + ratio, and prints a Markdown table to stdout (or --md OUT.md). Same-version-only by Q8 — cross-version pairs land in their own "ignored" section. Cross-host pairs get a header warning; the script still runs but the numbers shouldn't be trusted as a true delta.

Pure stdlib Python — no third-party deps.


What's NOT measured (and why)

  • CPU%. Noisy on a shared machine; redundant with wall-clock for single-threaded workloads.
  • Concurrency curves. Engine is single-writer by design (Phase 4e). No concurrent-writer workload is meaningful until that changes.
  • Network I/O. All targets are in-process. Cloudflare D1 / rqlite are explicitly out of scope per the plan's viability section — they're network-dominated and would measure latency, not engine throughput.
  • libSQL. Deferred — its embedded SQL surface tracks SQLite within a few percent, so a third row-oriented OLTP comparator would mostly add noise. Worth revisiting alongside a vector-only benchmark page when sqlite-vec / libSQL native vector indexes become a useful comparison axis.
  • fsync count on macOS. /proc/self/io exists on Linux only. On macOS the equivalent would need dtrace; out of scope for v1.

See also