debug: isolate batched drift to attention output at Layer 3+ for pos>=1

unamedkr · claude · unamedkr · commit f843e66252a2 · 2026-04-15T19:07:02.000+09:00
High-precision vector-wide dumps (sum + sumabs + spot samples at
[0:4] and [dim/2]) narrow the batched vs per-token divergence to
attention output for tok1 (pos=1) starting at Layer 3:

  L3 OB tok0: sum=-1.141631 sumabs=25.858075  (MATCHES baseline)
  L3 OB tok1: sum=-0.418790 sumabs=31.292446
    baseline: sum=-0.418400 sumabs=31.292183  (0.000390 drift)

Individual element drift is 1-7 ULP at magnitude 0.003 — truly at
the FP32 noise floor, but compounds to 1% by Layer 15 and flips the
argmax token.

Key insight: tok0 (pos=0) is PERFECT through all layers. Only tok1+
drifts, specifically at L3 onward. This pattern strongly implicates
RoPE at pos=1 (RoPE at pos=0 is identity, no drift opportunity) or
the K/V cache read at positions 0 for tok1's attention (which requires
batched's own WK/WV output for tok0 to match baseline exactly).

Our WK/WV matmul for tok0 at L3 APPEARS to match (since tok0 chain
is bit-identical), but the attention output for tok1 reading that K[0]
still produces a different result. This can only mean:
  (a) K[0] at L3 does differ subtly (below 4 decimal dump precision)
  (b) the per-token attention scoring in my batched code produces
      different FP rounding than baseline's NEON inner loop at pos&gt;=1

Remaining hypotheses for next session:
  - Dump K-cache[layer=3, pos=0] sum with high precision to confirm/
    rule out K-cache drift
  - If K matches: check attention score computation step-by-step
  - Possibility: my attention code at pos&gt;=1 has two-position sum where
    the FP order differs from baseline's seq_len loop

Current state committed: vector accumulator in bm_q4_worker is retained
(architecturally correct even though not sufficient). All higher-
precision dumps are behind TQ_DEBUG_PREFILL. Default behavior unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -2242,8 +2242,12 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
     if (has_gguf) tq_metal_batch_flush_if_available();
     TQ_PROF_STOP(_tp, matmul_ns);
 
-    if (l <= 3 && pos <= 1 && getenv("TQ_DEBUG_PREFILL")) {
-        fprintf(stderr, "[fwd]   L%d pos=%d xb2 (after wo) [0:8] = ", l, pos);
+    if (l == 3 && pos <= 1 && getenv("TQ_DEBUG_PREFILL")) {
+        double _sum=0, _sabs=0;
+        for (int i = 0; i < dim; i++) { _sum += s->xb[i]; _sabs += (s->xb[i]<0?-s->xb[i]:s->xb[i]); }
+        fprintf(stderr, "[fwd]   L3 pos=%d xb sum=%.6f sumabs=%.6f [0:4]=%.6f,%.6f,%.6f,%.6f [dim/2]=%.6f\n",
+            pos, _sum, _sabs, s->xb[0], s->xb[1], s->xb[2], s->xb[3], s->xb[dim/2]);
+        fprintf(stderr, "[fwd]   L3 pos=%d xb2 (after wo) [0:8] = ", pos);
         for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", s->xb2[i]);
         fprintf(stderr, "\n");
     }
@@ -3389,10 +3393,15 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
             }
         }
 
-        if (l == 0 && dbg) {
-            fprintf(stderr, "[batch] L0 OB (post-attn) tok0 [0:8] = ");
-            for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", OB[i]);
-            fprintf(stderr, "\n");
+        if (l == 3 && dbg) {
+            /* Compute hash over OB to detect drift anywhere in the vector */
+            for (int tn = 0; tn < N && tn < 2; tn++) {
+                float* obr = OB + (size_t)tn * dim;
+                double s=0, sabs=0;
+                for (int i = 0; i < dim; i++) { s += obr[i]; sabs += (obr[i]<0?-obr[i]:obr[i]); }
+                fprintf(stderr, "[batch] L3 OB tok%d sum=%.6f sumabs=%.6f [0:4]=%.6f,%.6f,%.6f,%.6f [dim/2]=%.6f\n",
+                    tn, s, sabs, obr[0], obr[1], obr[2], obr[3], obr[dim/2]);
+            }
         }
         /* 5. O matmul batched + Q2 residual */
         tq_batched_matmul_q4(X, layer->wo_q4, layer->wo_q4s, OB, dim, q_dim, N, NULL);