Skip to content

Commit 1a7fe20

Browse files
unamedkrclaude
andcommitted
honest correction #9: add eval-length caveats to S1/progressive benchmarks
All PPL measurements were at 957 tokens (tokenizer cap). The S1 claim "2-bit+k512 Pareto-dominates flat 4-bit" was measured with 53.5% FP32 tokens — not representative of long context (1.6% at 32K). Reframed from "Pareto-dominated" to "likely dominated, theoretically motivated but not empirically validated at long context." The progressive finding (4-bit+k128: +3.8%→+0.6%) is validated at 957 tokens where k128=13.4% FP32, representative of ~1K context use. Both benchmark docs now include explicit validation notes. This is honest correction #9 in the project's retraction track record. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b8286b0 commit 1a7fe20

2 files changed

Lines changed: 20 additions & 4 deletions

File tree

bench/results/attention_aware_quantization.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,24 @@ information-theoretically near-optimal.
2727

2828
| Method | PPL penalty | Memory (32K) | Pareto-optimal? |
2929
|---|---:|---:|---|
30-
| Flat 4-bit | +3.8% | 2.30 GB | ~~yes~~ **no** dominated by 2b+k512 |
31-
| **2-bit + k512** | **+4.3%** | **1.19 GB** | **YES**same quality, half memory |
30+
| Flat 4-bit | +3.8% | 2.30 GB | likely dominated by 2b+k512 |
31+
| **2-bit + k512** | **+4.3%** | **1.19 GB** | **YES**similar quality, half memory |
3232
| 4-bit + k128 | +0.6% | 2.33 GB | YES — best quality |
3333

34-
Flat 4-bit is **Pareto-dominated**: 2-bit + k512 achieves the same PPL
35-
at half the memory. There is no reason to use flat 4-bit anymore.
34+
## IMPORTANT CAVEAT (Honest Correction #9)
35+
36+
All PPL measurements were performed at **957 tokens** (tokenizer cap).
37+
At this eval length, k_highres=512 means **53.5% of tokens are FP32**
38+
very different from real long-context use (e.g., 32K where it's 1.6%).
39+
40+
The "Pareto-dominates" claim is **theoretically motivated** (attention
41+
concentrates ~70% on recent ~500 tokens) but **NOT empirically validated
42+
at long context**. The true 2-bit quality at 32K context with only 1.6%
43+
FP32 tokens is likely worse than measured here.
44+
45+
**What IS validated**: progressive (4-bit + k128) at 957 tokens, where
46+
k128 = 13.4% FP32 — similar to the real ratio at 1K context. This
47+
finding (+3.8% → +0.6%) is reliable.
3648

3749
## Memory impact at scale
3850

bench/results/progressive_kv_compression.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ The sweet spot is **k_highres=128**:
3434
- 64→128 shows meaningful improvement (13.71 → 13.64)
3535
- Below 64 the benefit drops off
3636

37+
**Validation note**: measured at 957 tokens. k128 = 13.4% FP32, which is
38+
representative of real ~1K context. Longer-context validation (4K+) pending
39+
due to tokenizer cap at ~958 tokens. The finding is reliable at this scale.
40+
3741
## Memory Impact at Scale
3842

3943
At 32K context with Llama 3.2 3B:

0 commit comments

Comments
 (0)