feat(phi3): fused QKV → Q4 split path (opt-in via TQ_PHI3_SPLIT=1)

unamedkr · claude · unamedkr · commit 0f9042708e0b · 2026-04-16T02:51:42.000+09:00
Adds support for splitting Phi-3's fused gguf_w_qkv (and fused
gguf_w_up_gate) into separate wq/wk/wv / w_gate/w_up FP32 tensors
during load-time Q4 conversion. With this path active, Phi-3.5 Q4_K_M
can use the batched prefill fast path — measured 4.6× end-to-end
speedup (149s → 32s) on a ~250-token prompt.

However: the conversion introduces a measurable quality regression
on arithmetic/exact tasks. Phi-3.5 Q4_K_M's "2+2=" went from "4" to
"3" after conversion — the internal Q4 format has per-32 scales
where Q4_K has 6-bit sub-block scales (strictly more precision).

Gated behind TQ_PHI3_SPLIT=1 until a higher-precision split path
is available. Default behavior unchanged:
  - Phi-3.5 Q4_K_M stays on raw-GGUF int8 path (int8 dot kernel)
  - No batched prefill for Phi-3 (too precise to risk regression)

The 11/11 STRICT tests now pass with default settings. Users who
prioritize prefill speed on long Phi-3 prompts can opt in via:
  TQ_PHI3_SPLIT=1 ./build/quant phi3.gguf -p "long prompt..." -n N

Future work: investigate Q4+Q2 progressive conversion for Phi-3 to
preserve Q4_K-level precision while gaining batched.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/results/2026-04-15_throughput_vs_llamacpp.md b/bench/results/2026-04-15_throughput_vs_llamacpp.md
@@ -54,29 +54,37 @@ User-visible impact on a 16GB Mac: feeding a 1000-token prompt to
 Phi-3.5-mini takes ~10 minutes today. With a batched-prefill path it
 should be under 15 seconds.
 
-### Update 2026-04-16: batched prefill landed (FP32 KV mode)
+### Update 2026-04-16: batched prefill is now the DEFAULT
 
-A new `tq_forward_batch` path uses batched matmul via Apple Accelerate
-(`cblas_sgemm`-inspired, 1.2 TFLOPS). Auto-enabled when `-k fp32`.
+`tq_forward_batch` uses batched matmul + quant K cache write so it
+works with the default `turbo_kv_4b` KV mode (not just `-k fp32`).
+Auto-activates on Llama-family models; bails to per-token for MoE,
+Gemma 4, Phi-3 fused QKV, DeltaNet hybrid, etc.
 
-Measured prefill on ~250-token prompt (50 English words):
+Measured prefill on ~250-token prompt (50 English words), DEFAULT KV
+(turbo_kv_4b):
 
-| Model | Baseline | Batched | Speedup |
+| Model | Baseline (per-token) | Batched (new default) | Speedup |
 |---|---:|---:|---:|
-| Llama-3.2-1B Q8 | 43 s | **7 s** | **6.1×** |
-| Llama-3.2-3B Q8 | 146 s | **61 s** | **2.4×** |
-
-Note: llama.cpp pp512 CPU is 358 tok/s for 1B (1.4 s per 500 tokens).
-We're now at ~65 tok/s for 1B (3.8 s per 250 tokens) — still **5× behind
-llama.cpp**, but the previous gap was **35×**. This round closed 85% of
-the prefill gap for FP32-KV models.
-
-Remaining gap sources:
-- Default FP16 V cache (most users): per-token fallback until drift-fix
-- Non-Llama architectures (Phi-3 fused QKV, DeltaNet hybrids): per-token fallback
-- Pure matmul gap: even batched matmul is ~5× slower than llama.cpp's
-  AMX+cblas_sgemm (because we still dequant Q4→FP32 rather than keeping
-  quantized int8 matmul in the batched code)
+| Llama-3.2-1B Q8 | 43 s | **5.9 s** | **7.2×** |
+| Llama-3.2-3B Q8 | ~148 s (est.) | ~63 s | ~2.4× |
+
+Direct vs llama.cpp (pp256 CPU, same machine):
+
+| Model | quant.cpp pp | llama.cpp pp | Ratio |
+|---|---:|---:|---:|
+| Llama-3.2-1B Q8 | ~37 tok/s | 358 tok/s | **10.3%** (was 0.4% session-start) |
+| Llama-3.2-3B Q8 | ~17 tok/s | 130 tok/s | **13%** (was 0.4% session-start) |
+
+Prefill gap closed from **~35-40×** to **~8-10×** — **4× closer** in
+one day. Output bit-identical to per-token baseline.
+
+Remaining 8-10× gap sources:
+- Llama.cpp uses int8 quantized matmul directly on AMX. Our batched
+  code still dequants Q4→FP32 internally in `tq_batched_matmul_q4`.
+- Architecture specializations (Phi-3 fused QKV, DeltaNet) still
+  per-token; extending batched is engineering work tracked in
+  `docs/dev/batched_prefill_handoff.md`.
 
 ## Session improvements (2026-04-15)
 
diff --git a/src/engine/tq_model.c b/src/engine/tq_model.c
@@ -3906,10 +3906,21 @@ tq_model_t* tq_load_gguf(const char* path) {
             goto skip_q4_conversion;
         }
         int has_gguf_weights = 0;
+        int phi3_split = (getenv("TQ_PHI3_SPLIT") != NULL);
         for (int l = 0; l < c->n_layers && !has_gguf_weights; l++) {
-            if (model->layers[l].gguf_wq || model->layers[l].gguf_w_gate
-                || model->layers[l].gguf_delta_qkv || model->layers[l].gguf_delta_z
-                || model->layers[l].moe)
+            tq_layer_weights_t* ly = &model->layers[l];
+            int fused_convertible = 0;
+            if (phi3_split && ly->gguf_w_qkv
+                && (ly->gguf_w_qkv_type == TQ_GGML_TYPE_Q4_K
+                 || ly->gguf_w_qkv_type == TQ_GGML_TYPE_Q4_0
+                 || ly->gguf_w_qkv_type == TQ_GGML_TYPE_Q5_K)) fused_convertible = 1;
+            if (phi3_split && ly->gguf_w_up_gate
+                && (ly->gguf_w_up_gate_type == TQ_GGML_TYPE_Q4_K
+                 || ly->gguf_w_up_gate_type == TQ_GGML_TYPE_Q4_0
+                 || ly->gguf_w_up_gate_type == TQ_GGML_TYPE_Q5_K)) fused_convertible = 1;
+            if (ly->gguf_wq || ly->gguf_w_gate
+                || ly->gguf_delta_qkv || ly->gguf_delta_z
+                || ly->moe || fused_convertible)
                 has_gguf_weights = 1;
         }
 
@@ -4007,6 +4018,70 @@ tq_model_t* tq_load_gguf(const char* path) {
                     }
                 }
 
+                /* Phi-3 fused QKV split to internal Q4 was TESTED and caused
+                 * quality regression on arithmetic tasks (2+2=3 instead of 4).
+                 * The internal Q4 format has per-32 scales where Q4_K uses
+                 * 6-bit sub-block scales — strictly less precision. The split
+                 * path is preserved below but gated behind TQ_PHI3_SPLIT=1
+                 * for future work on a better-quality path (e.g., Q4+Q2 with
+                 * larger residual tables, or keeping Q5_K for critical rows). */
+                if (layer->gguf_w_qkv && getenv("TQ_PHI3_SPLIT")
+                    && (layer->gguf_w_qkv_type == TQ_GGML_TYPE_Q4_K
+                     || layer->gguf_w_qkv_type == TQ_GGML_TYPE_Q4_0
+                     || layer->gguf_w_qkv_type == TQ_GGML_TYPE_Q5_K)) {
+                    int lq = c->n_heads * c->head_dim;
+                    int lkv = c->n_kv_heads * c->head_dim;
+                    int total_out = lq + 2 * lkv;
+                    int n_full = total_out * dim;
+                    float* fp_full = (float*)malloc((size_t)n_full * sizeof(float));
+                    if (fp_full) {
+                        tq_dequant_row_gguf(layer->gguf_w_qkv_type, layer->gguf_w_qkv, fp_full, n_full);
+                        /* Split row-major [total_out, dim] → wq/wk/wv. */
+                        float* wq = (float*)malloc((size_t)lq  * dim * sizeof(float));
+                        float* wk = (float*)malloc((size_t)lkv * dim * sizeof(float));
+                        float* wv = (float*)malloc((size_t)lkv * dim * sizeof(float));
+                        if (wq && wk && wv) {
+                            memcpy(wq, fp_full,                              (size_t)lq  * dim * sizeof(float));
+                            memcpy(wk, fp_full + (size_t)lq * dim,           (size_t)lkv * dim * sizeof(float));
+                            memcpy(wv, fp_full + (size_t)(lq + lkv) * dim,   (size_t)lkv * dim * sizeof(float));
+                            layer->wq = wq;
+                            layer->wk = wk;
+                            layer->wv = wv;
+                            fp32_temps[n_tmp++] = wq;
+                            fp32_temps[n_tmp++] = wk;
+                            fp32_temps[n_tmp++] = wv;
+                        }
+                        free(fp_full);
+                        layer->gguf_w_qkv = NULL;
+                        c->has_fused_qkv = 0; /* after split, no longer fused */
+                    }
+                }
+                /* Same quality caveat as fused QKV. Gated behind TQ_PHI3_SPLIT. */
+                if (layer->gguf_w_up_gate && getenv("TQ_PHI3_SPLIT")
+                    && (layer->gguf_w_up_gate_type == TQ_GGML_TYPE_Q4_K
+                     || layer->gguf_w_up_gate_type == TQ_GGML_TYPE_Q4_0
+                     || layer->gguf_w_up_gate_type == TQ_GGML_TYPE_Q5_K)) {
+                    int n_full = 2 * lint * dim;
+                    float* fp_full = (float*)malloc((size_t)n_full * sizeof(float));
+                    if (fp_full) {
+                        tq_dequant_row_gguf(layer->gguf_w_up_gate_type, layer->gguf_w_up_gate, fp_full, n_full);
+                        /* Phi-3 layout is [gate | up] concatenated along output rows. */
+                        float* w_gate = (float*)malloc((size_t)lint * dim * sizeof(float));
+                        float* w_up   = (float*)malloc((size_t)lint * dim * sizeof(float));
+                        if (w_gate && w_up) {
+                            memcpy(w_gate, fp_full,                         (size_t)lint * dim * sizeof(float));
+                            memcpy(w_up,   fp_full + (size_t)lint * dim,    (size_t)lint * dim * sizeof(float));
+                            layer->w_gate = w_gate;
+                            layer->w_up = w_up;
+                            fp32_temps[n_tmp++] = w_gate;
+                            fp32_temps[n_tmp++] = w_up;
+                        }
+                        free(fp_full);
+                        layer->gguf_w_up_gate = NULL;
+                        c->has_fused_up_gate = 0;
+                    }
+                }
+
                 /* DeltaNet weights: dequant GGUF -> FP32 */
                 if (layer->gguf_delta_qkv && delta_qkv_dim > 0) {
                     int n = delta_qkv_dim * dim;