Skip to content

Commit 90c3552

Browse files
unamedkrclaude
andcommitted
debug: softmax cliff identified as batched drift amplifier
High-precision sum/sumabs dumps across layers reveal the drift mechanism: tok0 (pos=0) final Xres — drift at NOISE FLOOR through all layers: L0: 40.478353289 vs base 40.478353226 (diff 6.3e-8, ~1 ULP) L1: diff 8e-9 L2: diff 3e-7 (~4 ULP over 2048 elements) L3: diff 6.6e-7 tok1 (pos=1) final Xres — drift JUMPS at L3: L0: diff 2e-7 (~2 ULP) — noise L1: diff 6.4e-7 — noise L2: diff 9.5e-7 — noise L3: sum diff 0.004, sumabs diff 0.072 — 5 orders bigger What L3 pos=1 does different from L2 pos=1: same code paths, same number of attention positions (2), same weights. So the mathematical ops are identical. But ONE of those ops has near-softmax-cliff scores at L3 — where att[0] and att[1] happen to be within 1 ULP of each other. A tiny numerical drift in score computation flips which V gets more weight, producing disproportionate OB drift. This is a FUNDAMENTAL property of attention: at softmax cliffs, 1 ULP input drift causes order-of-magnitude output drift. It's not a bug per-se — it's why bit-identical reproducibility between different execution paths is so hard for transformers. Pragmatic paths to resolution: (a) FP32 V cache (remove the FP16 round-trip, our largest drift source). Costs 2x KV memory but eliminates ULP-scale noise. (b) Accept drift; batched prefill will work for most prompts but occasionally flip tokens. Measure rate on a test suite. (c) Continue stamping out every ULP difference until bit-identical. Achievable but weeks of careful work. Today's session ends with clear understanding: batched matmul primitive is solid; integration surfaces FP reproducibility challenges that are fundamental rather than engineering. Strategy (a) — FP32 V cache — is the most promising next step. It eliminates the largest drift source in one commit. All changes committed on TQ_BATCH_PREFILL opt-in gate. 11/11 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f843e66 commit 90c3552

1 file changed

Lines changed: 8 additions & 6 deletions

File tree

src/engine/tq_transformer.c

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2830,9 +2830,9 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
28302830

28312831
layer_postprocess:
28322832
if (l <= 3 && pos <= 1 && getenv("TQ_DEBUG_PREFILL")) {
2833-
fprintf(stderr, "[fwd] L%d pos=%d final x [0:8] = ", l, pos);
2834-
for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", s->x[i]);
2835-
fprintf(stderr, "\n");
2833+
double _s=0, _sa=0;
2834+
for (int i = 0; i < dim; i++) { _s += s->x[i]; _sa += (s->x[i]<0?-s->x[i]:s->x[i]); }
2835+
fprintf(stderr, "[fwd] L%d pos=%d final x sum=%.9f sumabs=%.9f\n", l, pos, _s, _sa);
28362836
}
28372837
/* Post-layer processing: PLE, layer_output_scale.
28382838
* GPU graph path jumps here after full-layer GPU forward. */
@@ -3486,9 +3486,11 @@ int tq_forward_batch(tq_model_t* model, tq_state_t* s,
34863486

34873487
if ((l <= 3) && dbg) {
34883488
for (int tn = 0; tn < N && tn < 2; tn++) {
3489-
fprintf(stderr, "[batch] L%d final Xres tok%d [0:8] = ", l, tn);
3490-
for (int i = 0; i < 8; i++) fprintf(stderr, "%.4f ", Xres[(size_t)tn * dim + i]);
3491-
fprintf(stderr, "\n");
3489+
float* row = Xres + (size_t)tn * dim;
3490+
double s=0, sabs=0;
3491+
for (int i = 0; i < dim; i++) { s += row[i]; sabs += (row[i]<0?-row[i]:row[i]); }
3492+
fprintf(stderr, "[batch] L%d final Xres tok%d sum=%.9f sumabs=%.9f\n",
3493+
l, tn, s, sabs);
34923494
}
34933495
}
34943496
}

0 commit comments

Comments
 (0)