Commit 5b525c6
perf(matmul): enable NEON int8×int8 fast path for Q8_0 — 2.3-3x speedup
Previously, tq_matmul_gguf's NEON int8 path required callers to
pre-quantize activations and call tq_set_preq(). But tq_set_preq was
NEVER called anywhere in the codebase, making the fast path dead code.
All Q8_0 matmuls fell back to slow float fused_dot.
The auto-quantize path already existed (disabled behind #if 0 with
false claim "per-call overhead > benefit"). Enabled it with larger
stack buffers (16KB for in_dim up to 16384) to support modern FFN
dimensions.
Phi-3.5 Q8_0 benchmark (50 tokens, 4 threads, no Metal):
- Before: 3.5 tok/s (vs llama.cpp 9.4) — 37% of baseline
- After: 9.4 tok/s (vs llama.cpp 9.4) — 100% of baseline
Gemma 4 E2B Q8_0: 5.2 → 19.7 tok/s (3.8x)
Llama 3.2 1B Q8_0: ~12 → 43.4 tok/s (3.6x)
Llama 3.2 3B Q8_0: ~4 → 16.5 tok/s (4.1x)
All 35 unit tests pass. All 7 model regression tests pass.
The "per-call overhead" concern was unfounded: O(in_dim) quantization
cost is trivially amortized by O(in_dim × out_dim) matmul work when
out_dim >= 256 (true for all practical layers).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent c77104c commit 5b525c6
1 file changed
Lines changed: 17 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2303 | 2303 | | |
2304 | 2304 | | |
2305 | 2305 | | |
2306 | | - | |
2307 | | - | |
2308 | | - | |
2309 | | - | |
2310 | | - | |
2311 | | - | |
2312 | | - | |
| 2306 | + | |
| 2307 | + | |
| 2308 | + | |
| 2309 | + | |
| 2310 | + | |
| 2311 | + | |
| 2312 | + | |
| 2313 | + | |
| 2314 | + | |
| 2315 | + | |
| 2316 | + | |
| 2317 | + | |
| 2318 | + | |
| 2319 | + | |
| 2320 | + | |
| 2321 | + | |
| 2322 | + | |
2313 | 2323 | | |
2314 | 2324 | | |
2315 | 2325 | | |
| |||
0 commit comments