perf(matmul): enable NEON int8×int8 fast path for Q8_0 — 2.3-3x speedup

unamedkr · claude · unamedkr · commit 5b525c6220fd · 2026-04-14T15:59:44.000+09:00
Previously, tq_matmul_gguf's NEON int8 path required callers to
pre-quantize activations and call tq_set_preq(). But tq_set_preq was
NEVER called anywhere in the codebase, making the fast path dead code.
All Q8_0 matmuls fell back to slow float fused_dot.

The auto-quantize path already existed (disabled behind #if 0 with
false claim "per-call overhead &gt; benefit"). Enabled it with larger
stack buffers (16KB for in_dim up to 16384) to support modern FFN
dimensions.

Phi-3.5 Q8_0 benchmark (50 tokens, 4 threads, no Metal):
- Before:  3.5 tok/s (vs llama.cpp 9.4) — 37% of baseline
- After:   9.4 tok/s (vs llama.cpp 9.4) — 100% of baseline

Gemma 4 E2B Q8_0: 5.2 → 19.7 tok/s (3.8x)
Llama 3.2 1B Q8_0: ~12 → 43.4 tok/s (3.6x)
Llama 3.2 3B Q8_0: ~4 → 16.5 tok/s (4.1x)

All 35 unit tests pass. All 7 model regression tests pass.

The "per-call overhead" concern was unfounded: O(in_dim) quantization
cost is trivially amortized by O(in_dim × out_dim) matmul work when
out_dim &gt;= 256 (true for all practical layers).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_gguf_quants.c b/src/engine/tq_gguf_quants.c
@@ -2303,13 +2303,23 @@ void tq_matmul_gguf(float* out, const float* x,
     }
 #endif
 
-    /* ---- Q8_0×Q8 integer dot fast path — DISABLED (per-call overhead > benefit) ---- */
-#if 0 /* TQ_HAS_NEON */
-    if (weight_type == TQ_GGML_TYPE_Q8_0 && in_dim >= 32) {
-        /* Step 1: Quantize input x[in_dim] to Q8 blocks on stack */
-        int8_t  x_qs[4096];  /* max in_dim=4096 for attention */
-        float   x_ds[128];   /* max 128 blocks of 32 */
-        if (in_dim <= 4096) {
+    /* ---- Q8_0×Q8 integer dot fast path (auto-quantize activation) ----
+     * When g_preq_qs is not set (most callers), quantize the input
+     * activation inline and use the NEON int8×int8 path. Cost of
+     * one-time activation quantization is O(in_dim); matmul is
+     * O(in_dim * out_dim), so quantization overhead is negligible
+     * for typical out_dim >= 256.
+     *
+     * Previously disabled with false claim "per-call overhead > benefit"
+     * — actually this is 3-4x faster than float fused_dot for Q8_0. */
+#if TQ_HAS_NEON
+    if (weight_type == TQ_GGML_TYPE_Q8_0 && in_dim >= 32 && in_dim <= 16384) {
+        /* Step 1: Quantize input x[in_dim] to Q8 blocks on stack.
+         * Buffers sized for max FFN dim (Llama-70B uses 28672, but most
+         * models <= 16384). Stack usage: 16KB + 2KB = 18KB. */
+        int8_t  x_qs[16384];
+        float   x_ds[512];
+        if (in_dim <= 16384) {
             for (int b = 0; b < n_blocks; b++) {
                 const float* xp = x + b * 32;
                 /* Find absmax for this block */