Skip to content

Commit 30dca7a

Browse files
unamedkrclaude
andcommitted
fix(metal): Phi-3.5 Q4_K_M garbage output under default build
Root cause: tq_matmul_gguf_cpu() helper unconditionally reset tq_matmul_force_cpu to 0 after its own matmul, clobbering the Phi-3 forward-pass invariant that tq_forward sets. The invariant says: for has_fused_qkv models, ALL matmuls in this forward must stay on CPU because Metal kernels produce garbage for Phi-3.5's dims. The helper ran once for fused QKV/FFN (which it correctly forced to CPU), then reset the flag to 0, so subsequent wo/w_down/ lm_head matmuls dispatched to Metal and corrupted intermediate state. Output under default build: "etandideti hypothesis Rot Rothrivial...". Phi-3.5 Q8_0 was spared because Q8 Metal kernel happens to match CPU output; Q4_K does not. Fix: 1. tq_matmul_gguf_cpu: save prev flag, restore (not hard-reset). 2. tq_matmul_force_cpu: drop _Thread_local. Worker threads in the matmul thread pool must see the flag value set by the main thread; with _Thread_local, they saw 0 (default) and independently dispatched to Metal. Verified: - Phi-3.5 Q4_K_M + Metal → "in the land of Eldoria, there lived an old sage named Alaric" (was: "etandideti hypothesis Rot...") - Phi-3.5 Q8_0, Qwen3.5-4B Q4_K_M + Metal: no regression Test hardening: scripts/test_models.sh gains a Metal-ON tier that runs without TQ_NO_METAL=1. The existing tests set that env var universally, which hid this bug for an unknown duration. 11/11 PASS with default build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 361c40f commit 30dca7a

3 files changed

Lines changed: 23 additions & 4 deletions

File tree

scripts/test_models.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,16 @@ run_test "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" "Hello" "" COHERENT "TQ_NO_METAL
9292
# Uses --chat so the ChatML template wrapping is tested end-to-end.
9393
run_test "Qwen3.5-4B-Q4_K_M.gguf" "Hi" "Hello" STRICT "TQ_NO_METAL=1" "--chat"
9494

95+
echo ""
96+
echo "--- Metal-ON tier (default build must also produce coherent output) ---"
97+
# Regression guard (f0091fc follow-up, 2026-04-15): tq_matmul_gguf_cpu used
98+
# to hard-reset tq_matmul_force_cpu=0, breaking the Phi-3 force-CPU invariant
99+
# mid-forward. Metal dispatches for wo/w_down then produced garbage.
100+
# These tests run WITHOUT TQ_NO_METAL so the default build path is exercised.
101+
run_test "Phi-3.5-mini-instruct-Q4_K_M.gguf" "Once upon a time" "" COHERENT ""
102+
run_test "Qwen3.5-4B-Q4_K_M.gguf" "Once upon a time" "" COHERENT ""
103+
run_test "Llama-3.2-3B-Instruct-Q8_0.gguf" "Hello" "" COHERENT ""
104+
95105
echo ""
96106
echo "--- Summary ---"
97107
echo " PASS: $PASS"

src/engine/tq_gguf_quants.c

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2364,7 +2364,11 @@ void tq_matmul_gguf_cpu(float* out, const float* x,
23642364
* before calling tq_matmul_gguf to skip Metal dispatch. Used for
23652365
* Phi-3 fused QKV/FFN matmuls where Metal has a buffer sizing bug
23662366
* with the unusually large output dimensions. */
2367-
_Thread_local int tq_matmul_force_cpu = 0;
2367+
/* Global flag (NOT thread-local) — worker threads in the matmul thread pool
2368+
* must see the same value set by the main thread. Prior _Thread_local version
2369+
* silently allowed Metal dispatch from worker threads despite the main thread
2370+
* setting the flag to 1 (observed as Phi-3.5 Q4_K_M garbage output under Metal). */
2371+
int tq_matmul_force_cpu = 0;
23682372

23692373
void tq_matmul_gguf(float* out, const float* x,
23702374
const void* weight, tq_ggml_dtype weight_type,
@@ -2663,9 +2667,14 @@ void tq_matmul_gguf_cpu(float* out, const float* x,
26632667
const void* weight, tq_ggml_dtype weight_type,
26642668
int out_dim, int in_dim)
26652669
{
2670+
/* Save-and-restore, not hard-reset. A caller (e.g., tq_forward for
2671+
* Phi-3) may have already set the flag to 1 as an invariant for the
2672+
* entire forward pass; hard-resetting to 0 here would let subsequent
2673+
* matmuls in the same forward dispatch to Metal and produce garbage. */
2674+
int prev = tq_matmul_force_cpu;
26662675
tq_matmul_force_cpu = 1;
26672676
tq_matmul_gguf(out, x, weight, weight_type, out_dim, in_dim);
2668-
tq_matmul_force_cpu = 0;
2677+
tq_matmul_force_cpu = prev;
26692678
}
26702679

26712680
/* ============================================================

src/engine/tq_transformer.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2281,7 +2281,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
22812281
* on Apple Silicon. Restored at function exit. */
22822282
int _phi3_force_cpu = c->has_fused_qkv;
22832283
if (_phi3_force_cpu) {
2284-
extern _Thread_local int tq_matmul_force_cpu;
2284+
extern int tq_matmul_force_cpu;
22852285
tq_matmul_force_cpu = 1;
22862286
}
22872287

@@ -3008,7 +3008,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
30083008

30093009
/* Restore Metal dispatch for non-Phi3 models */
30103010
if (_phi3_force_cpu) {
3011-
extern _Thread_local int tq_matmul_force_cpu;
3011+
extern int tq_matmul_force_cpu;
30123012
tq_matmul_force_cpu = 0;
30133013
}
30143014
return s->logits;

0 commit comments

Comments
 (0)