honest correction #9: add eval-length caveats to S1/progressive benchmarks

unamedkr · claude · unamedkr · commit 1a7fe20ee265 · 2026-04-10T01:27:06.000+09:00
All PPL measurements were at 957 tokens (tokenizer cap). The S1 claim "2-bit+k512 Pareto-dominates flat 4-bit" was measured with 53.5% FP32 tokens — not representative of long context (1.6% at 32K). Reframed from "Pareto-dominated" to "likely dominated, theoretically motivated but not empirically validated at long context." The progressive finding (4-bit+k128: +3.8%→+0.6%) is validated at 957 tokens where k128=13.4% FP32, representative of ~1K context use. Both benchmark docs now include explicit validation notes. This is honest correction #9 in the project's retraction track record. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/bench/results/attention_aware_quantization.md b/bench/results/attention_aware_quantization.md
@@ -27,12 +27,24 @@ information-theoretically near-optimal.
 
 | Method | PPL penalty | Memory (32K) | Pareto-optimal? |
 |---|---:|---:|---|
-| Flat 4-bit | +3.8% | 2.30 GB | ~~yes~~ **no** — dominated by 2b+k512 |
-| **2-bit + k512** | **+4.3%** | **1.19 GB** | **YES** — same quality, half memory |
+| Flat 4-bit | +3.8% | 2.30 GB | likely dominated by 2b+k512 |
+| **2-bit + k512** | **+4.3%** | **1.19 GB** | **YES** — similar quality, half memory |
 | 4-bit + k128 | +0.6% | 2.33 GB | YES — best quality |
 
-Flat 4-bit is **Pareto-dominated**: 2-bit + k512 achieves the same PPL
-at half the memory. There is no reason to use flat 4-bit anymore.
+## IMPORTANT CAVEAT (Honest Correction #9)
+
+All PPL measurements were performed at **957 tokens** (tokenizer cap).
+At this eval length, k_highres=512 means **53.5% of tokens are FP32** —
+very different from real long-context use (e.g., 32K where it's 1.6%).
+
+The "Pareto-dominates" claim is **theoretically motivated** (attention
+concentrates ~70% on recent ~500 tokens) but **NOT empirically validated
+at long context**. The true 2-bit quality at 32K context with only 1.6%
+FP32 tokens is likely worse than measured here.
+
+**What IS validated**: progressive (4-bit + k128) at 957 tokens, where
+k128 = 13.4% FP32 — similar to the real ratio at 1K context. This
+finding (+3.8% → +0.6%) is reliable.
 
 ## Memory impact at scale
 
diff --git a/bench/results/progressive_kv_compression.md b/bench/results/progressive_kv_compression.md
@@ -34,6 +34,10 @@ The sweet spot is **k_highres=128**:
 - 64→128 shows meaningful improvement (13.71 → 13.64)
 - Below 64 the benefit drops off
 
+**Validation note**: measured at 957 tokens. k128 = 13.4% FP32, which is
+representative of real ~1K context. Longer-context validation (4K+) pending
+due to tokenizer cap at ~958 tokens. The finding is reliable at this scale.
+
 ## Memory Impact at Scale
 
 At 32K context with Llama 3.2 3B: