debug: root-cause FP16 V garbage — it's actually quant K cache mismatch

unamedkr · claude · unamedkr · commit f4934e9abd7b · 2026-04-16T00:28:01.000+09:00
Further investigation showed V_fp16 cache bytes are BIT-IDENTICAL between
batched and baseline at all layers 0-3 (XOR hash match). So V was never
the issue.

The real root cause: baseline with KV quantization (default
turbo_kv_4b) stores K in quant_key_cache (not s-&gt;key_cache FP32). Its
attention reads K from quant_key_cache and dequants on-the-fly.

My tq_forward_batch unconditionally writes FP32 K to s-&gt;key_cache.
When final tq_forward(pos=last) runs after batched, it reads from
quant_key_cache for history positions — which is ZERO because batched
never populated it. Attention sees zero K for pos 0, breaks output.

This is correctly guarded by the existing kv_is_fp32 gate: batched
only runs when KV cache is FP32, so the quant_key_cache mismatch
doesn't occur. Removed the diagnostic dumps now that the analysis
is documented.

To enable batched for default (quant K) mode later: batched needs to
write to quant_key_cache via traits-&gt;quantize per head per block.
That's ~50 LOC but touches multiple code paths. Deferred.

11/11 STRICT tests pass. Default behavior unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -3298,9 +3298,6 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
                 memcpy(s->value_cache + (size_t)l * kv_layer_stride + (size_t)pos * kv_dim,
                        VB + (size_t)n * kv_dim, (size_t)kv_dim * sizeof(float));
             } else if (s->value_cache_fp16) {
-                /* Match tq_forward exactly: hardware FP16 conversion via NEON
-                 * vcvt_f16_f32. Inline manual conversion gave subtly different
-                 * rounding which propagated through attention and broke output. */
                 uint16_t* dst = s->value_cache_fp16
                               + (size_t)l * kv_layer_stride + (size_t)pos * kv_dim;
                 f32_to_fp16_vec(VB + (size_t)n * kv_dim, dst, kv_dim);