Skip to content

Commit f843e66

Browse files
unamedkrclaude
andcommitted
debug: isolate batched drift to attention output at Layer 3+ for pos>=1
High-precision vector-wide dumps (sum + sumabs + spot samples at [0:4] and [dim/2]) narrow the batched vs per-token divergence to attention output for tok1 (pos=1) starting at Layer 3: L3 OB tok0: sum=-1.141631 sumabs=25.858075 (MATCHES baseline) L3 OB tok1: sum=-0.418790 sumabs=31.292446 baseline: sum=-0.418400 sumabs=31.292183 (0.000390 drift) Individual element drift is 1-7 ULP at magnitude 0.003 — truly at the FP32 noise floor, but compounds to 1% by Layer 15 and flips the argmax token. Key insight: tok0 (pos=0) is PERFECT through all layers. Only tok1+ drifts, specifically at L3 onward. This pattern strongly implicates RoPE at pos=1 (RoPE at pos=0 is identity, no drift opportunity) or the K/V cache read at positions 0 for tok1's attention (which requires batched's own WK/WV output for tok0 to match baseline exactly). Our WK/WV matmul for tok0 at L3 APPEARS to match (since tok0 chain is bit-identical), but the attention output for tok1 reading that K[0] still produces a different result. This can only mean: (a) K[0] at L3 does differ subtly (below 4 decimal dump precision) (b) the per-token attention scoring in my batched code produces different FP rounding than baseline's NEON inner loop at pos>=1 Remaining hypotheses for next session: - Dump K-cache[layer=3, pos=0] sum with high precision to confirm/ rule out K-cache drift - If K matches: check attention score computation step-by-step - Possibility: my attention code at pos>=1 has two-position sum where the FP order differs from baseline's seq_len loop Current state committed: vector accumulator in bm_q4_worker is retained (architecturally correct even though not sufficient). All higher- precision dumps are behind TQ_DEBUG_PREFILL. Default behavior unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 442c2d7 commit f843e66

1 file changed

Lines changed: 15 additions & 6 deletions

File tree

src/engine/tq_transformer.c

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2242,8 +2242,12 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
22422242
if (has_gguf) tq_metal_batch_flush_if_available();
22432243
TQ_PROF_STOP(_tp, matmul_ns);
22442244

2245-
if (l <= 3 && pos <= 1 && getenv("TQ_DEBUG_PREFILL")) {
2246-
fprintf(stderr, "[fwd] L%d pos=%d xb2 (after wo) [0:8] = ", l, pos);
2245+
if (l == 3 && pos <= 1 && getenv("TQ_DEBUG_PREFILL")) {
2246+
double _sum=0, _sabs=0;
2247+
for (int i = 0; i < dim; i++) { _sum += s->xb[i]; _sabs += (s->xb[i]<0?-s->xb[i]:s->xb[i]); }
2248+
fprintf(stderr, "[fwd] L3 pos=%d xb sum=%.6f sumabs=%.6f [0:4]=%.6f,%.6f,%.6f,%.6f [dim/2]=%.6f\n",
2249+
pos, _sum, _sabs, s->xb[0], s->xb[1], s->xb[2], s->xb[3], s->xb[dim/2]);
2250+
fprintf(stderr, "[fwd] L3 pos=%d xb2 (after wo) [0:8] = ", pos);
22472251
for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", s->xb2[i]);
22482252
fprintf(stderr, "\n");
22492253
}
@@ -3389,10 +3393,15 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
33893393
}
33903394
}
33913395

3392-
if (l == 0 && dbg) {
3393-
fprintf(stderr, "[batch] L0 OB (post-attn) tok0 [0:8] = ");
3394-
for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", OB[i]);
3395-
fprintf(stderr, "\n");
3396+
if (l == 3 && dbg) {
3397+
/* Compute hash over OB to detect drift anywhere in the vector */
3398+
for (int tn = 0; tn < N && tn < 2; tn++) {
3399+
float* obr = OB + (size_t)tn * dim;
3400+
double s=0, sabs=0;
3401+
for (int i = 0; i < dim; i++) { s += obr[i]; sabs += (obr[i]<0?-obr[i]:obr[i]); }
3402+
fprintf(stderr, "[batch] L3 OB tok%d sum=%.6f sumabs=%.6f [0:4]=%.6f,%.6f,%.6f,%.6f [dim/2]=%.6f\n",
3403+
tn, s, sabs, obr[0], obr[1], obr[2], obr[3], obr[dim/2]);
3404+
}
33963405
}
33973406
/* 5. O matmul batched + Q2 residual */
33983407
tq_batched_matmul_q4(X, layer->wo_q4, layer->wo_q4s, OB, dim, q_dim, N, NULL);

0 commit comments

Comments
 (0)