Commit f843e66
debug: isolate batched drift to attention output at Layer 3+ for pos>=1
High-precision vector-wide dumps (sum + sumabs + spot samples at
[0:4] and [dim/2]) narrow the batched vs per-token divergence to
attention output for tok1 (pos=1) starting at Layer 3:
L3 OB tok0: sum=-1.141631 sumabs=25.858075 (MATCHES baseline)
L3 OB tok1: sum=-0.418790 sumabs=31.292446
baseline: sum=-0.418400 sumabs=31.292183 (0.000390 drift)
Individual element drift is 1-7 ULP at magnitude 0.003 — truly at
the FP32 noise floor, but compounds to 1% by Layer 15 and flips the
argmax token.
Key insight: tok0 (pos=0) is PERFECT through all layers. Only tok1+
drifts, specifically at L3 onward. This pattern strongly implicates
RoPE at pos=1 (RoPE at pos=0 is identity, no drift opportunity) or
the K/V cache read at positions 0 for tok1's attention (which requires
batched's own WK/WV output for tok0 to match baseline exactly).
Our WK/WV matmul for tok0 at L3 APPEARS to match (since tok0 chain
is bit-identical), but the attention output for tok1 reading that K[0]
still produces a different result. This can only mean:
(a) K[0] at L3 does differ subtly (below 4 decimal dump precision)
(b) the per-token attention scoring in my batched code produces
different FP rounding than baseline's NEON inner loop at pos>=1
Remaining hypotheses for next session:
- Dump K-cache[layer=3, pos=0] sum with high precision to confirm/
rule out K-cache drift
- If K matches: check attention score computation step-by-step
- Possibility: my attention code at pos>=1 has two-position sum where
the FP order differs from baseline's seq_len loop
Current state committed: vector accumulator in bm_q4_worker is retained
(architecturally correct even though not sufficient). All higher-
precision dumps are behind TQ_DEBUG_PREFILL. Default behavior unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 442c2d7 commit f843e66
1 file changed
Lines changed: 15 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2242 | 2242 | | |
2243 | 2243 | | |
2244 | 2244 | | |
2245 | | - | |
2246 | | - | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
2247 | 2251 | | |
2248 | 2252 | | |
2249 | 2253 | | |
| |||
3389 | 3393 | | |
3390 | 3394 | | |
3391 | 3395 | | |
3392 | | - | |
3393 | | - | |
3394 | | - | |
3395 | | - | |
| 3396 | + | |
| 3397 | + | |
| 3398 | + | |
| 3399 | + | |
| 3400 | + | |
| 3401 | + | |
| 3402 | + | |
| 3403 | + | |
| 3404 | + | |
3396 | 3405 | | |
3397 | 3406 | | |
3398 | 3407 | | |
| |||
0 commit comments