Skip to content

Commit 012ed22

Browse files
unamedkrclaude
andcommitted
docs: throughput benchmark vs llama.cpp + session perf summary
Adds bench/results/2026-04-15_throughput_vs_llamacpp.md with a head- to-head table for 5 models (Phi-3.5 Q4/Q8, Qwen3.5-4B, Llama 3.2 1B/3B). Honest framing: we're 3-6× behind llama.cpp Metal but 22-71% of llama.cpp pure-CPU on the same hardware. Documents the +58% to +141% session throughput gains and the five compounding changes that produced them (Q4_K int8 dot, Q5_K int8 dot, vdotq_s32, prefetch, 2-row ILP, P-core default). README v3.1 update block links to the benchmark doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d0df1f5 commit 012ed22

2 files changed

Lines changed: 89 additions & 0 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
163163
164164
> **v3 update — Crossing the Cliff with RLV (2026-04-14):** If the cliff is real, the fix is to stop asking one LLM call to hold a full document in working memory. **RLV (Read-Locate-Verify)** is a 5-stage pipeline — gist → locate → lookup → verify → research — where each stage stays below the ~1K-token cliff while the *document* can be arbitrarily long. On 12K-token wikitext (≈10× the cliff for Llama 3.2 3B Q4), **RLV scores 10/10** vs. 8/10 for verify-only and 1/10 for long-context-only. Key trick: BM25 + Reciprocal Rank Fusion does the locating; the LLM is only a tiebreaker. Runs on the same 16GB Mac as the 3B model — no RAG index, no embeddings. [`bench/rlv/`](bench/rlv/) · [`docs/phase3_rlv_challenge.md`](docs/phase3_rlv_challenge.md)
165165
166+
> **v3.1 throughput update (2026-04-15):** A focused perf round (Q4_K/Q5_K int8 fused dot, ARMv8.2 `vdotq_s32`, weight-row prefetch, 2-row ILP, P-core thread default) lifted CPU generation throughput by **+58% to +141%** across our model lineup on M1 Pro. Phi-3.5-mini Q8_0 jumped 5.4 → 13.0 tok/s (now at 71% of llama.cpp's pure-CPU speed). We're still 3-6× behind llama.cpp's mature Metal kernels — that's the next gap to close. Full numbers + reproduce instructions: [`bench/results/2026-04-15_throughput_vs_llamacpp.md`](bench/results/2026-04-15_throughput_vs_llamacpp.md).
167+
166168
---
167169

168170
## More Features
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Generation throughput — quant.cpp vs llama.cpp (2026-04-15)
2+
3+
**Hardware**: Apple M1 Pro, 16GB, 8 P-cores + 2 E-cores
4+
**Test**: `tg64` (generate 64 tokens at T=0), 8 threads, default CMake build
5+
**Reproduce**:
6+
```bash
7+
# quant.cpp (3-run median)
8+
./build/quant <model> -p "Once upon a time" -n 64 -T 0
9+
10+
# llama.cpp Metal
11+
llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 99
12+
13+
# llama.cpp CPU only
14+
llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 0
15+
```
16+
17+
## Results
18+
19+
| Model | quant.cpp | llama.cpp Metal | llama.cpp CPU | vs Metal | vs CPU |
20+
|---|---:|---:|---:|---:|---:|
21+
| Llama-3.2-1B Q8_0 | **35.5** | 89.0 | 68.1 | 40% | **52%** |
22+
| Phi-3.5-mini Q8_0 | **13.0** | 36.8 | 18.3 | 35% | **71%**|
23+
| Llama-3.2-3B Q8_0 | **13.4** | 43.3 | 26.3 | 31% | 51% |
24+
| Phi-3.5-mini Q4_K_M | **6.9** | 41.6 | 30.1 | 17% | 23% |
25+
| Qwen3.5-4B Q4_K_M | **5.6** | 30.7 | 22.1 | 18% | 25% |
26+
27+
(All numbers are tokens/sec; quant.cpp is 3-run median, llama.cpp single run.)
28+
29+
## Honest reading
30+
31+
- **vs Metal**: llama.cpp wins decisively (3-6×). Their Metal kernels are mature; ours are CPU-fallback for several model families. This is the gap to close in the v1.x roadmap.
32+
- **vs CPU (apples-to-apples)**: we're at **23-71%** of llama.cpp's pure-CPU speed depending on model. Phi-3.5 Q8_0 at 71% is competitive.
33+
- **Smaller models close the gap**: 1B Q8 at 52% vs 3B/4B Q4_K_M at 23-25% suggests our Q4_K dispatch (raw GGUF path) is the largest remaining gap. The Q4-converted path (3B Llama, 1B Llama) is more competitive.
34+
35+
## Session improvements (2026-04-15)
36+
37+
Compared to the same hardware before this session:
38+
39+
| Model | Before | After | Δ |
40+
|---|---:|---:|---:|
41+
| Phi-3.5-mini Q4_K_M | 3.2 | 6.9 | **+115%** |
42+
| Phi-3.5-mini Q8_0 | 5.4 | 13.0 | **+141%** |
43+
| Qwen3.5-4B Q4_K_M | 3.5 | 5.6 | **+60%** |
44+
| Llama-3.2-3B Q8_0 | 8.5 | 13.4 | **+58%** |
45+
46+
Wins came from five compounding changes:
47+
48+
1. **Q4_K int8 fused dot path** (`src/engine/tq_gguf_quants.c`). Was doing
49+
`vfmaq_f32` over float-converted nibbles. Now quantizes activation to int8
50+
once per matmul, runs `vdotq_s32` over nibbles unpacked to int8.
51+
Pre-computes per-block int sums for the dmin*mn correction.
52+
2. **Q5_K int8 fused dot path**. Same approach, with the 5th bit unpacked
53+
from the Q5_K `qh` array via `vceqq_u8``vorrq` to merge.
54+
3. **ARMv8.2 `vdotq_s32`** wherever int8 dot is needed (Q8_0, Q4_K, Q5_K
55+
workers). Previously used `vmull_s8 + vpadalq_s16` (8 MACs/op);
56+
`vdotq_s32` does 16 MACs/op. Gated on `__ARM_FEATURE_DOTPROD`.
57+
4. **Weight-row prefetching** with `__builtin_prefetch`. M1 hardware
58+
prefetcher does not always pick up the row-stride pattern across matmul
59+
iterations. Explicit prefetch of next row's first 4 cache lines hides
60+
the load latency.
61+
5. **2-row inner-loop ILP** in the Q4_K worker. Two output rows share the
62+
same activation; pairing their dot products lets the OoO engine overlap
63+
weight loads with activation broadcasts.
64+
6. **P-core thread default**. M1 Pro is 8P+2E. Mixing P and E at the same
65+
priority makes the slow E threads stragglers — total throughput drops.
66+
Detect via `sysctlbyname("hw.perflevel0.physicalcpu")`.
67+
68+
## Other 2026-04-15 fixes
69+
70+
- `f0091fc` — Qwen3.5-4B DeltaNet layers were mis-detected as self-attention
71+
in the split-source build; fix probes for `ssm_a` before the Phi-3
72+
fused-QKV path. Output went from whitespace garbage to coherent.
73+
- `30dca7a` — Phi-3.5 Q4_K_M produced garbage under the default Metal
74+
build because `tq_matmul_gguf_cpu` hard-reset the force-CPU flag,
75+
clobbering tq_forward's invariant. Save-and-restore.
76+
- `8f5784a` — DeltaNet attn_qkv/attn_gate were dequanted Q5_K → FP32 at
77+
load (3GB extra per token in bandwidth). Verified identical generation
78+
with Q5_K kept; default flipped.
79+
80+
## Quality regression guards
81+
82+
```
83+
scripts/test_models.sh — 11/11 PASS (STRICT + COHERENT + Metal-ON)
84+
scripts/test_long_seq.sh — 6/6 PASS (500 tokens at T=0, 100% printable)
85+
scripts/check_sync.sh — 8 sections PASS (catches future split-source drift)
86+
scripts/check_stale.sh — binary mtime guard (catches stale-build confusion)
87+
```

0 commit comments

Comments
 (0)