|
| 1 | +# Generation throughput — quant.cpp vs llama.cpp (2026-04-15) |
| 2 | + |
| 3 | +**Hardware**: Apple M1 Pro, 16GB, 8 P-cores + 2 E-cores |
| 4 | +**Test**: `tg64` (generate 64 tokens at T=0), 8 threads, default CMake build |
| 5 | +**Reproduce**: |
| 6 | +```bash |
| 7 | +# quant.cpp (3-run median) |
| 8 | +./build/quant <model> -p "Once upon a time" -n 64 -T 0 |
| 9 | + |
| 10 | +# llama.cpp Metal |
| 11 | +llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 99 |
| 12 | + |
| 13 | +# llama.cpp CPU only |
| 14 | +llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 0 |
| 15 | +``` |
| 16 | + |
| 17 | +## Results |
| 18 | + |
| 19 | +| Model | quant.cpp | llama.cpp Metal | llama.cpp CPU | vs Metal | vs CPU | |
| 20 | +|---|---:|---:|---:|---:|---:| |
| 21 | +| Llama-3.2-1B Q8_0 | **35.5** | 89.0 | 68.1 | 40% | **52%** | |
| 22 | +| Phi-3.5-mini Q8_0 | **13.0** | 36.8 | 18.3 | 35% | **71%** ⭐ | |
| 23 | +| Llama-3.2-3B Q8_0 | **13.4** | 43.3 | 26.3 | 31% | 51% | |
| 24 | +| Phi-3.5-mini Q4_K_M | **6.9** | 41.6 | 30.1 | 17% | 23% | |
| 25 | +| Qwen3.5-4B Q4_K_M | **5.6** | 30.7 | 22.1 | 18% | 25% | |
| 26 | + |
| 27 | +(All numbers are tokens/sec; quant.cpp is 3-run median, llama.cpp single run.) |
| 28 | + |
| 29 | +## Honest reading |
| 30 | + |
| 31 | +- **vs Metal**: llama.cpp wins decisively (3-6×). Their Metal kernels are mature; ours are CPU-fallback for several model families. This is the gap to close in the v1.x roadmap. |
| 32 | +- **vs CPU (apples-to-apples)**: we're at **23-71%** of llama.cpp's pure-CPU speed depending on model. Phi-3.5 Q8_0 at 71% is competitive. |
| 33 | +- **Smaller models close the gap**: 1B Q8 at 52% vs 3B/4B Q4_K_M at 23-25% suggests our Q4_K dispatch (raw GGUF path) is the largest remaining gap. The Q4-converted path (3B Llama, 1B Llama) is more competitive. |
| 34 | + |
| 35 | +## Session improvements (2026-04-15) |
| 36 | + |
| 37 | +Compared to the same hardware before this session: |
| 38 | + |
| 39 | +| Model | Before | After | Δ | |
| 40 | +|---|---:|---:|---:| |
| 41 | +| Phi-3.5-mini Q4_K_M | 3.2 | 6.9 | **+115%** | |
| 42 | +| Phi-3.5-mini Q8_0 | 5.4 | 13.0 | **+141%** | |
| 43 | +| Qwen3.5-4B Q4_K_M | 3.5 | 5.6 | **+60%** | |
| 44 | +| Llama-3.2-3B Q8_0 | 8.5 | 13.4 | **+58%** | |
| 45 | + |
| 46 | +Wins came from five compounding changes: |
| 47 | + |
| 48 | +1. **Q4_K int8 fused dot path** (`src/engine/tq_gguf_quants.c`). Was doing |
| 49 | + `vfmaq_f32` over float-converted nibbles. Now quantizes activation to int8 |
| 50 | + once per matmul, runs `vdotq_s32` over nibbles unpacked to int8. |
| 51 | + Pre-computes per-block int sums for the dmin*mn correction. |
| 52 | +2. **Q5_K int8 fused dot path**. Same approach, with the 5th bit unpacked |
| 53 | + from the Q5_K `qh` array via `vceqq_u8` → `vorrq` to merge. |
| 54 | +3. **ARMv8.2 `vdotq_s32`** wherever int8 dot is needed (Q8_0, Q4_K, Q5_K |
| 55 | + workers). Previously used `vmull_s8 + vpadalq_s16` (8 MACs/op); |
| 56 | + `vdotq_s32` does 16 MACs/op. Gated on `__ARM_FEATURE_DOTPROD`. |
| 57 | +4. **Weight-row prefetching** with `__builtin_prefetch`. M1 hardware |
| 58 | + prefetcher does not always pick up the row-stride pattern across matmul |
| 59 | + iterations. Explicit prefetch of next row's first 4 cache lines hides |
| 60 | + the load latency. |
| 61 | +5. **2-row inner-loop ILP** in the Q4_K worker. Two output rows share the |
| 62 | + same activation; pairing their dot products lets the OoO engine overlap |
| 63 | + weight loads with activation broadcasts. |
| 64 | +6. **P-core thread default**. M1 Pro is 8P+2E. Mixing P and E at the same |
| 65 | + priority makes the slow E threads stragglers — total throughput drops. |
| 66 | + Detect via `sysctlbyname("hw.perflevel0.physicalcpu")`. |
| 67 | + |
| 68 | +## Other 2026-04-15 fixes |
| 69 | + |
| 70 | +- `f0091fc` — Qwen3.5-4B DeltaNet layers were mis-detected as self-attention |
| 71 | + in the split-source build; fix probes for `ssm_a` before the Phi-3 |
| 72 | + fused-QKV path. Output went from whitespace garbage to coherent. |
| 73 | +- `30dca7a` — Phi-3.5 Q4_K_M produced garbage under the default Metal |
| 74 | + build because `tq_matmul_gguf_cpu` hard-reset the force-CPU flag, |
| 75 | + clobbering tq_forward's invariant. Save-and-restore. |
| 76 | +- `8f5784a` — DeltaNet attn_qkv/attn_gate were dequanted Q5_K → FP32 at |
| 77 | + load (3GB extra per token in bandwidth). Verified identical generation |
| 78 | + with Q5_K kept; default flipped. |
| 79 | + |
| 80 | +## Quality regression guards |
| 81 | + |
| 82 | +``` |
| 83 | +scripts/test_models.sh — 11/11 PASS (STRICT + COHERENT + Metal-ON) |
| 84 | +scripts/test_long_seq.sh — 6/6 PASS (500 tokens at T=0, 100% printable) |
| 85 | +scripts/check_sync.sh — 8 sections PASS (catches future split-source drift) |
| 86 | +scripts/check_stale.sh — binary mtime guard (catches stale-build confusion) |
| 87 | +``` |
0 commit comments