Commit 1a7fe20
honest correction #9: add eval-length caveats to S1/progressive benchmarks
All PPL measurements were at 957 tokens (tokenizer cap). The S1 claim
"2-bit+k512 Pareto-dominates flat 4-bit" was measured with 53.5% FP32
tokens — not representative of long context (1.6% at 32K). Reframed
from "Pareto-dominated" to "likely dominated, theoretically motivated
but not empirically validated at long context."
The progressive finding (4-bit+k128: +3.8%→+0.6%) is validated at 957
tokens where k128=13.4% FP32, representative of ~1K context use.
Both benchmark docs now include explicit validation notes. This is
honest correction #9 in the project's retraction track record.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent b8286b0 commit 1a7fe20
2 files changed
Lines changed: 20 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
31 | | - | |
| 30 | + | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
36 | 48 | | |
37 | 49 | | |
38 | 50 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
37 | 41 | | |
38 | 42 | | |
39 | 43 | | |
| |||
0 commit comments