fix(metal): Phi-3.5 Q4_K_M garbage output under default build

unamedkr · claude · unamedkr · commit 30dca7a9f8f0 · 2026-04-15T09:08:27.000+09:00
Root cause: tq_matmul_gguf_cpu() helper unconditionally reset
tq_matmul_force_cpu to 0 after its own matmul, clobbering the
Phi-3 forward-pass invariant that tq_forward sets. The invariant
says: for has_fused_qkv models, ALL matmuls in this forward must
stay on CPU because Metal kernels produce garbage for Phi-3.5's
dims. The helper ran once for fused QKV/FFN (which it correctly
forced to CPU), then reset the flag to 0, so subsequent wo/w_down/
lm_head matmuls dispatched to Metal and corrupted intermediate
state. Output under default build: "etandideti hypothesis Rot
Rothrivial...". Phi-3.5 Q8_0 was spared because Q8 Metal kernel
happens to match CPU output; Q4_K does not.

Fix:
  1. tq_matmul_gguf_cpu: save prev flag, restore (not hard-reset).
  2. tq_matmul_force_cpu: drop _Thread_local. Worker threads in
     the matmul thread pool must see the flag value set by the
     main thread; with _Thread_local, they saw 0 (default) and
     independently dispatched to Metal.

Verified:
  - Phi-3.5 Q4_K_M + Metal → "in the land of Eldoria, there lived
    an old sage named Alaric" (was: "etandideti hypothesis Rot...")
  - Phi-3.5 Q8_0, Qwen3.5-4B Q4_K_M + Metal: no regression

Test hardening: scripts/test_models.sh gains a Metal-ON tier
that runs without TQ_NO_METAL=1. The existing tests set that env
var universally, which hid this bug for an unknown duration.
11/11 PASS with default build.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/scripts/test_models.sh b/scripts/test_models.sh
@@ -92,6 +92,16 @@ run_test "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf"   "Hello" ""  COHERENT "TQ_NO_METAL
 # Uses --chat so the ChatML template wrapping is tested end-to-end.
 run_test "Qwen3.5-4B-Q4_K_M.gguf"              "Hi" "Hello" STRICT "TQ_NO_METAL=1" "--chat"
 
+echo ""
+echo "--- Metal-ON tier (default build must also produce coherent output) ---"
+# Regression guard (f0091fc follow-up, 2026-04-15): tq_matmul_gguf_cpu used
+# to hard-reset tq_matmul_force_cpu=0, breaking the Phi-3 force-CPU invariant
+# mid-forward. Metal dispatches for wo/w_down then produced garbage.
+# These tests run WITHOUT TQ_NO_METAL so the default build path is exercised.
+run_test "Phi-3.5-mini-instruct-Q4_K_M.gguf"   "Once upon a time" ""  COHERENT ""
+run_test "Qwen3.5-4B-Q4_K_M.gguf"              "Once upon a time" ""  COHERENT ""
+run_test "Llama-3.2-3B-Instruct-Q8_0.gguf"     "Hello" ""  COHERENT ""
+
 echo ""
 echo "--- Summary ---"
 echo "  PASS: $PASS"
diff --git a/src/engine/tq_gguf_quants.c b/src/engine/tq_gguf_quants.c
@@ -2364,7 +2364,11 @@ void tq_matmul_gguf_cpu(float* out, const float* x,
  * before calling tq_matmul_gguf to skip Metal dispatch. Used for
  * Phi-3 fused QKV/FFN matmuls where Metal has a buffer sizing bug
  * with the unusually large output dimensions. */
-_Thread_local int tq_matmul_force_cpu = 0;
+/* Global flag (NOT thread-local) — worker threads in the matmul thread pool
+ * must see the same value set by the main thread. Prior _Thread_local version
+ * silently allowed Metal dispatch from worker threads despite the main thread
+ * setting the flag to 1 (observed as Phi-3.5 Q4_K_M garbage output under Metal). */
+int tq_matmul_force_cpu = 0;
 
 void tq_matmul_gguf(float* out, const float* x,
                     const void* weight, tq_ggml_dtype weight_type,
@@ -2663,9 +2667,14 @@ void tq_matmul_gguf_cpu(float* out, const float* x,
                          const void* weight, tq_ggml_dtype weight_type,
                          int out_dim, int in_dim)
 {
+    /* Save-and-restore, not hard-reset. A caller (e.g., tq_forward for
+     * Phi-3) may have already set the flag to 1 as an invariant for the
+     * entire forward pass; hard-resetting to 0 here would let subsequent
+     * matmuls in the same forward dispatch to Metal and produce garbage. */
+    int prev = tq_matmul_force_cpu;
     tq_matmul_force_cpu = 1;
     tq_matmul_gguf(out, x, weight, weight_type, out_dim, in_dim);
-    tq_matmul_force_cpu = 0;
+    tq_matmul_force_cpu = prev;
 }
 
 /* ============================================================
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -2281,7 +2281,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
      * on Apple Silicon. Restored at function exit. */
     int _phi3_force_cpu = c->has_fused_qkv;
     if (_phi3_force_cpu) {
-        extern _Thread_local int tq_matmul_force_cpu;
+        extern int tq_matmul_force_cpu;
         tq_matmul_force_cpu = 1;
     }
 
@@ -3008,7 +3008,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
 
     /* Restore Metal dispatch for non-Phi3 models */
     if (_phi3_force_cpu) {
-        extern _Thread_local int tq_matmul_force_cpu;
+        extern int tq_matmul_force_cpu;
         tq_matmul_force_cpu = 0;
     }
     return s->logits;