docs: throughput benchmark vs llama.cpp + session perf summary

unamedkr · claude · unamedkr · commit 012ed2270695 · 2026-04-15T14:38:17.000+09:00
Adds bench/results/2026-04-15_throughput_vs_llamacpp.md with a head-
to-head table for 5 models (Phi-3.5 Q4/Q8, Qwen3.5-4B, Llama 3.2
1B/3B). Honest framing: we're 3-6× behind llama.cpp Metal but
22-71% of llama.cpp pure-CPU on the same hardware.

Documents the +58% to +141% session throughput gains and the five
compounding changes that produced them (Q4_K int8 dot, Q5_K int8
dot, vdotq_s32, prefetch, 2-row ILP, P-core default).

README v3.1 update block links to the benchmark doc.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -163,6 +163,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v3 update — Crossing the Cliff with RLV (2026-04-14):** If the cliff is real, the fix is to stop asking one LLM call to hold a full document in working memory. **RLV (Read-Locate-Verify)** is a 5-stage pipeline — gist → locate → lookup → verify → research — where each stage stays below the ~1K-token cliff while the *document* can be arbitrarily long. On 12K-token wikitext (≈10× the cliff for Llama 3.2 3B Q4), **RLV scores 10/10** vs. 8/10 for verify-only and 1/10 for long-context-only. Key trick: BM25 + Reciprocal Rank Fusion does the locating; the LLM is only a tiebreaker. Runs on the same 16GB Mac as the 3B model — no RAG index, no embeddings. [`bench/rlv/`](bench/rlv/) · [`docs/phase3_rlv_challenge.md`](docs/phase3_rlv_challenge.md)
 
+> **v3.1 throughput update (2026-04-15):** A focused perf round (Q4_K/Q5_K int8 fused dot, ARMv8.2 `vdotq_s32`, weight-row prefetch, 2-row ILP, P-core thread default) lifted CPU generation throughput by **+58% to +141%** across our model lineup on M1 Pro. Phi-3.5-mini Q8_0 jumped 5.4 → 13.0 tok/s (now at 71% of llama.cpp's pure-CPU speed). We're still 3-6× behind llama.cpp's mature Metal kernels — that's the next gap to close. Full numbers + reproduce instructions: [`bench/results/2026-04-15_throughput_vs_llamacpp.md`](bench/results/2026-04-15_throughput_vs_llamacpp.md).
+
 ---
 
 ## More Features
diff --git a/bench/results/2026-04-15_throughput_vs_llamacpp.md b/bench/results/2026-04-15_throughput_vs_llamacpp.md
@@ -0,0 +1,87 @@
+# Generation throughput — quant.cpp vs llama.cpp (2026-04-15)
+
+**Hardware**: Apple M1 Pro, 16GB, 8 P-cores + 2 E-cores
+**Test**: `tg64` (generate 64 tokens at T=0), 8 threads, default CMake build
+**Reproduce**:
+```bash
+# quant.cpp (3-run median)
+./build/quant <model> -p "Once upon a time" -n 64 -T 0
+
+# llama.cpp Metal
+llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 99
+
+# llama.cpp CPU only
+llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 0
+```
+
+## Results
+
+| Model | quant.cpp | llama.cpp Metal | llama.cpp CPU | vs Metal | vs CPU |
+|---|---:|---:|---:|---:|---:|
+| Llama-3.2-1B Q8_0 | **35.5** | 89.0 | 68.1 | 40% | **52%** |
+| Phi-3.5-mini Q8_0 | **13.0** | 36.8 | 18.3 | 35% | **71%** ⭐ |
+| Llama-3.2-3B Q8_0 | **13.4** | 43.3 | 26.3 | 31% | 51% |
+| Phi-3.5-mini Q4_K_M | **6.9** | 41.6 | 30.1 | 17% | 23% |
+| Qwen3.5-4B Q4_K_M | **5.6** | 30.7 | 22.1 | 18% | 25% |
+
+(All numbers are tokens/sec; quant.cpp is 3-run median, llama.cpp single run.)
+
+## Honest reading
+
+- **vs Metal**: llama.cpp wins decisively (3-6×). Their Metal kernels are mature; ours are CPU-fallback for several model families. This is the gap to close in the v1.x roadmap.
+- **vs CPU (apples-to-apples)**: we're at **23-71%** of llama.cpp's pure-CPU speed depending on model. Phi-3.5 Q8_0 at 71% is competitive.
+- **Smaller models close the gap**: 1B Q8 at 52% vs 3B/4B Q4_K_M at 23-25% suggests our Q4_K dispatch (raw GGUF path) is the largest remaining gap. The Q4-converted path (3B Llama, 1B Llama) is more competitive.
+
+## Session improvements (2026-04-15)
+
+Compared to the same hardware before this session:
+
+| Model | Before | After | Δ |
+|---|---:|---:|---:|
+| Phi-3.5-mini Q4_K_M | 3.2 | 6.9 | **+115%** |
+| Phi-3.5-mini Q8_0 | 5.4 | 13.0 | **+141%** |
+| Qwen3.5-4B Q4_K_M | 3.5 | 5.6 | **+60%** |
+| Llama-3.2-3B Q8_0 | 8.5 | 13.4 | **+58%** |
+
+Wins came from five compounding changes:
+
+1. **Q4_K int8 fused dot path** (`src/engine/tq_gguf_quants.c`). Was doing
+   `vfmaq_f32` over float-converted nibbles. Now quantizes activation to int8
+   once per matmul, runs `vdotq_s32` over nibbles unpacked to int8.
+   Pre-computes per-block int sums for the dmin*mn correction.
+2. **Q5_K int8 fused dot path**. Same approach, with the 5th bit unpacked
+   from the Q5_K `qh` array via `vceqq_u8` → `vorrq` to merge.
+3. **ARMv8.2 `vdotq_s32`** wherever int8 dot is needed (Q8_0, Q4_K, Q5_K
+   workers). Previously used `vmull_s8 + vpadalq_s16` (8 MACs/op);
+   `vdotq_s32` does 16 MACs/op. Gated on `__ARM_FEATURE_DOTPROD`.
+4. **Weight-row prefetching** with `__builtin_prefetch`. M1 hardware
+   prefetcher does not always pick up the row-stride pattern across matmul
+   iterations. Explicit prefetch of next row's first 4 cache lines hides
+   the load latency.
+5. **2-row inner-loop ILP** in the Q4_K worker. Two output rows share the
+   same activation; pairing their dot products lets the OoO engine overlap
+   weight loads with activation broadcasts.
+6. **P-core thread default**. M1 Pro is 8P+2E. Mixing P and E at the same
+   priority makes the slow E threads stragglers — total throughput drops.
+   Detect via `sysctlbyname("hw.perflevel0.physicalcpu")`.
+
+## Other 2026-04-15 fixes
+
+- `f0091fc` — Qwen3.5-4B DeltaNet layers were mis-detected as self-attention
+  in the split-source build; fix probes for `ssm_a` before the Phi-3
+  fused-QKV path. Output went from whitespace garbage to coherent.
+- `30dca7a` — Phi-3.5 Q4_K_M produced garbage under the default Metal
+  build because `tq_matmul_gguf_cpu` hard-reset the force-CPU flag,
+  clobbering tq_forward's invariant. Save-and-restore.
+- `8f5784a` — DeltaNet attn_qkv/attn_gate were dequanted Q5_K → FP32 at
+  load (3GB extra per token in bandwidth). Verified identical generation
+  with Q5_K kept; default flipped.
+
+## Quality regression guards
+
+```
+scripts/test_models.sh    — 11/11 PASS (STRICT + COHERENT + Metal-ON)
+scripts/test_long_seq.sh  — 6/6 PASS (500 tokens at T=0, 100% printable)
+scripts/check_sync.sh     — 8 sections PASS (catches future split-source drift)
+scripts/check_stale.sh    — binary mtime guard (catches stale-build confusion)
+```