Commit bd063e0
debug: pinpoint batched-prefill drift to wo_matmul FP accumulation order
Extensive layer-by-layer diff between batched prefill and per-token
forward reveals the exact divergence point:
L0 tok0/tok1 Xres: bit-identical
L1 tok0/tok1 Xres: bit-identical
L2 tok0/tok1 Xres: bit-identical
L3 tok0 Xres: bit-identical
L3 tok1 Xres: 1-ULP drift at specific elements after wo matmul
Root cause: baseline's matmul_q4_rows uses NEON vector accumulation
(sumv0 = vmlaq_n_f32(...) + vaddvq_f32 tree reduce at end) while my
bm_q4_worker uses scalar acc[n] += wd*xd*isum per block. FP addition
is non-associative so the two orders give different rounding at 1-ULP
granularity. For tok0 this happens to produce bit-identical results;
for tok1 it diverges, and the drift compounds 1% per layer until the
final logit picks a wrong token ("hell hel" instead of "I'm so excited").
Also verified: TQ_BATCHED_SERIAL=1 (per-token matmul via
tq_matmul_q4_preq inside batched path) still produces wrong output,
confirming the bug is in N>=2 accumulator order even though individual
per-token results match for token 0 by coincidence.
Next session: refactor bm_q4_worker to use N separate float32x4_t
vector accumulators (one per token) and reduce with vaddvq_f32 at end,
exactly matching baseline's sumv0/sumv1 pattern. This is 30-50 LOC
change and should achieve bit-identical output across all layers.
Instrumented dumps retained behind TQ_DEBUG_PREFILL=1 for regression.
Default behavior unchanged; batched prefill still opt-in.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent ece4185 commit bd063e0
2 files changed
Lines changed: 49 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1152 | 1152 | | |
1153 | 1153 | | |
1154 | 1154 | | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
1155 | 1171 | | |
1156 | 1172 | | |
1157 | 1173 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2242 | 2242 | | |
2243 | 2243 | | |
2244 | 2244 | | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
2245 | 2250 | | |
2246 | 2251 | | |
2247 | 2252 | | |
| |||
2483 | 2488 | | |
2484 | 2489 | | |
2485 | 2490 | | |
| 2491 | + | |
| 2492 | + | |
| 2493 | + | |
| 2494 | + | |
| 2495 | + | |
2486 | 2496 | | |
2487 | 2497 | | |
2488 | 2498 | | |
| |||
2815 | 2825 | | |
2816 | 2826 | | |
2817 | 2827 | | |
| 2828 | + | |
| 2829 | + | |
| 2830 | + | |
| 2831 | + | |
| 2832 | + | |
2818 | 2833 | | |
2819 | 2834 | | |
2820 | 2835 | | |
| |||
3141 | 3156 | | |
3142 | 3157 | | |
3143 | 3158 | | |
3144 | | - | |
3145 | | - | |
3146 | | - | |
3147 | | - | |
| 3159 | + | |
| 3160 | + | |
| 3161 | + | |
| 3162 | + | |
| 3163 | + | |
| 3164 | + | |
3148 | 3165 | | |
3149 | 3166 | | |
3150 | 3167 | | |
| |||
3399 | 3416 | | |
3400 | 3417 | | |
3401 | 3418 | | |
3402 | | - | |
3403 | | - | |
3404 | | - | |
3405 | | - | |
3406 | | - | |
3407 | | - | |
3408 | | - | |
3409 | | - | |
3410 | | - | |
3411 | | - | |
| 3419 | + | |
| 3420 | + | |
| 3421 | + | |
| 3422 | + | |
| 3423 | + | |
| 3424 | + | |
3412 | 3425 | | |
3413 | 3426 | | |
3414 | 3427 | | |
| |||
3462 | 3475 | | |
3463 | 3476 | | |
3464 | 3477 | | |
3465 | | - | |
3466 | | - | |
3467 | | - | |
3468 | | - | |
| 3478 | + | |
| 3479 | + | |
| 3480 | + | |
| 3481 | + | |
| 3482 | + | |
| 3483 | + | |
3469 | 3484 | | |
3470 | 3485 | | |
3471 | 3486 | | |
| |||
0 commit comments