HONEST: correct the 'turbo_kv beats fp32' claim across README + CHANGELOG

unamedkr · claude · unamedkr · commit 33b6315acac9 · 2026-04-08T10:59:18.000+09:00
Validation revealed our v0.6.3 'turbo_kv beats fp32 KV speed' claim was wrong — an artifact of the fp32 attention path being unoptimized scalar while quant path was NEON. After fixing fp32 NEON (commit 4490c83), the honest gap on Llama 3.2 3B PPL eval is: Type tok/s vs FP32 -------------- ------- ------- fp32 (NEON) 14.83 baseline turbo_kv_4b 13.67 -7.8% turbo_kv_3b 13.4 -9.6% turbo_kv_5b 13.13 -11.5% The Round 5 optimization (transformer → traits->attention) is still real and meaningful (turbo_kv 6.9 → 13.7 tok/s, +98%). The honest framing is 'closes the speed gap from -45% to -8%', not 'beats fp32'. Updated: - README.md / README.ko.md headline tables and ASCII charts - CHANGELOG.md v0.6.3 entry with prominent Correction notice - v0.6.3 GitHub release notes with the same correction This is exactly what the validation step is for. Better to find and fix the wrong claim before it propagates than to be wrong publicly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,19 +2,18 @@
 
 ## [0.6.3] — 2026-04-08
 
-### 🏆 turbo_kv now BEATS fp32 KV speed at 7× compression
+### Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%
 
-After 6 rounds of Karpathy iteration on the attention path, all three
-production turbo_kv types are now **both more compressed AND faster**
-than uncompressed FP32 KV on Llama 3.2 3B PPL eval (1040 tokens, 28
-layers, attention-heavy):
+> **Correction**: this entry originally claimed 'turbo_kv beats fp32 KV speed'. That was an artifact of the fp32 attention path being unoptimized scalar. After NEON-optimizing fp32 too (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We caught the wrong claim during validation and corrected it before publishing widely.
+
+After 9 rounds of Karpathy iteration, all three production turbo_kv types now run within 8–12% of fp32 KV speed while compressing 5.8–9.1×:
 
 | Type | Bytes/block | tok/s | vs FP32 | PPL | Δ vs FP32 |
 |---|---:|---:|---:|---:|---:|
-| FP32 KV | — | 12.6 | baseline | 13.56 | — |
-| **`turbo_kv_4b`** ⭐ | 72 | **13.9** | **+10% ⬆** | 14.33 | +5.7% |
-| **`turbo_kv_3b`** | 56 | **13.4** | **+6% ⬆** | 15.36 | +13.3% |
-| **`turbo_kv_5b`** 🏆 | 88 | **13.2** | **+5% ⬆** | 13.65 | +0.7% |
+| FP32 KV (NEON) | — | **14.83** | baseline | 13.56 | — |
+| **`turbo_kv_4b`** ⭐ | 72 | 13.67 | **−7.8%** | 14.33 | +5.7% |
+| **`turbo_kv_3b`** | 56 | 13.4 | −9.6% | 15.36 | +13.3% |
+| **`turbo_kv_5b`** 🏆 | 88 | 13.13 | −11.5% | 13.65 | +0.7% |
 
 ### What changed (Round 5: the real bottleneck)
 
diff --git a/README.ko.md b/README.ko.md
@@ -43,22 +43,23 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
 
 > **같은 하드웨어. 4–7배 긴 컨텍스트. PPL 측정 + 공개.**
 
-### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV = 13.56 PPL @ 12.6 tok/s)
+### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
 
-> **`turbo_kv_4b`는 fp32 KV보다 7배 더 압축되었으면서 더 빠릅니다** (long context 기준). Karpathy 루프로 속도 격차를 완전히 없앴습니다.
+> 9 라운드 Karpathy 루프로 quant-KV vs FP32-KV 속도 격차를 **−45%에서 −8%로** 줄였습니다. 5.8–7.1× 메모리 압축. fp32 raw 속도를 능가하지는 못하지만 **8% 이내까지 따라잡음.**
 
 | KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 | tok/s | vs FP32 속도 |
 |:--------|----:|----:|----:|----:|----:|----:|
-| FP32 reference | — | 1× | 13.56 | — | 12.6 | baseline |
-| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%** ⬆ |
-| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
-| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
-| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%** ⬆ |
-| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
-| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%** ⬆ |
-| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% | — | — |
-
-`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. **둘 다 5.8–7.1× 압축 + uncompressed FP32 KV보다 빠름.** 전체 Karpathy 루프 이력 (9 rounds across 3 sessions): [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+| FP32 reference (NEON) | — | 1× | 13.56 | — | 14.83 | baseline |
+| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
+| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
+| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
+| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
+| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
+| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
+
+`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto 추천: **5.8–7.1× 메모리 압축 + FP32 KV 92% 속도.** 전체 Karpathy 루프 이력: [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+
+> **이 비교에 대해**: v0.6.3 릴리스 노트에서 처음 "turbo_kv가 fp32 KV 속도를 능가"라고 주장했습니다. 그건 fp32 attention path가 scalar였기 때문에 발생한 artifact였고, fp32 path에 NEON을 추가한 후(commit `4490c83`) 정직한 격차는 `+5~10%`가 아닌 `−7~−12%`입니다. README와 v0.6.3 릴리스 노트를 정정했습니다.
 
 ### 컨텍스트 길이 증가 (`turbo_kv_4b` + `q4` value cache)
 
diff --git a/README.md b/README.md
@@ -43,37 +43,38 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
 
 > **Same hardware. 4–7x longer context. PPL measured and disclosed.**
 
-### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV = 13.56 PPL @ 12.6 tok/s)
+### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
 
-> **`turbo_kv_4b` is now both 7× more compressed AND faster than fp32 KV** at long context. The Karpathy loop closed the speed gap completely (PPL eval throughput).
+> 9 rounds of Karpathy iteration closed the quant-KV speed gap to FP32 KV from **−45% to −8%**, while delivering 5.8–7.1× memory compression. We do not (yet) beat fp32 in raw speed, but we get within 8% of it for ~7× less memory.
 
 | KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
 |:----------|------------:|------------:|----:|----------:|------:|--------------:|
-| FP32 reference | — | 1× | 13.56 | — | 12.6 | baseline |
-| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%** ⬆ |
-| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
-| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
-| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%** ⬆ |
-| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
-| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%** ⬆ |
-| llama.cpp `q4_0` KV (lit. survey) | ~70 | ~7.3× | ~14.99 | +10.6% | — | — |
+| FP32 reference (NEON) | — | 1× | 13.56 | — | 14.83 | baseline |
+| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
+| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
+| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
+| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
+| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
+| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
+| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% | — | — |
 
 ```
-                  PPL Degradation vs FP32           Throughput vs FP32
+                  PPL Degradation vs FP32           Speed vs FP32 KV
                        (lower is better)              (higher is better)
 
-  turbo_kv_5b     │█ +0.7%                       ████████████ +5%   ⬆
-  turbo_kv_4bo    │██▌ +2.5%                     ██████████▌ +1%
-  turbo_kv_3bo    │████▌ +4.5%                   ██████████ -26%   ↓
-  turbo_kv_4b ⭐  │█████ +5.7%                   ████████████▌ +10% ⬆
-  uniform_4b      │██████ +7.7%                  ████████████ -7%
-  turbo_kv_3b     │█████████████ +13.3%          ████████████ +6%   ⬆
+  turbo_kv_5b     │█ +0.7%                       █████████ −11.5%
+  turbo_kv_4bo    │██▌ +2.5%                     ████████ −14%
+  turbo_kv_4b ⭐  │█████ +5.7%                   ██████████ −7.8%
+  turbo_kv_3b     │█████████████ +13.3%          █████████ −9.6%
+  uniform_4b      │██████ +7.7%                  ███████ −21%
   llama.cpp q4_0  │██████████ +10.6%             — (not measured)
-  FP32 reference  │ ← 0%                         12.6 tok/s
-                   0%   +5%   +10%               9    10    11    12    13    14
+  FP32 reference  │ ← 0%                         14.83 tok/s ←
+                   0%   +5%   +10%               0   25%   50%   75%   100%
 ```
 
-`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations. **Both compress 5.8–7.1× and run faster than uncompressed FP32 KV.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations: **5.8–7.1× memory compression at 92% of FP32 KV speed.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+
+> **About this comparison**: We previously published v0.6.3 release notes claiming `turbo_kv` beats `fp32` KV speed. That was an artifact of the fp32 attention path being unoptimized scalar — once we added NEON to the fp32 path (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We've corrected the README and the v0.6.3 release notes.
 
 ### Context length gains (`turbo_kv_4b` + `q4` value cache)