Skip to content

Commit 33b6315

Browse files
unamedkrclaude
andcommitted
HONEST: correct the 'turbo_kv beats fp32' claim across README + CHANGELOG
Validation revealed our v0.6.3 'turbo_kv beats fp32 KV speed' claim was wrong — an artifact of the fp32 attention path being unoptimized scalar while quant path was NEON. After fixing fp32 NEON (commit 4490c83), the honest gap on Llama 3.2 3B PPL eval is: Type tok/s vs FP32 -------------- ------- ------- fp32 (NEON) 14.83 baseline turbo_kv_4b 13.67 -7.8% turbo_kv_3b 13.4 -9.6% turbo_kv_5b 13.13 -11.5% The Round 5 optimization (transformer → traits->attention) is still real and meaningful (turbo_kv 6.9 → 13.7 tok/s, +98%). The honest framing is 'closes the speed gap from -45% to -8%', not 'beats fp32'. Updated: - README.md / README.ko.md headline tables and ASCII charts - CHANGELOG.md v0.6.3 entry with prominent Correction notice - v0.6.3 GitHub release notes with the same correction This is exactly what the validation step is for. Better to find and fix the wrong claim before it propagates than to be wrong publicly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4490c83 commit 33b6315

3 files changed

Lines changed: 42 additions & 41 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,18 @@
22

33
## [0.6.3] — 2026-04-08
44

5-
### 🏆 turbo_kv now BEATS fp32 KV speed at 7× compression
5+
### Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%
66

7-
After 6 rounds of Karpathy iteration on the attention path, all three
8-
production turbo_kv types are now **both more compressed AND faster**
9-
than uncompressed FP32 KV on Llama 3.2 3B PPL eval (1040 tokens, 28
10-
layers, attention-heavy):
7+
> **Correction**: this entry originally claimed 'turbo_kv beats fp32 KV speed'. That was an artifact of the fp32 attention path being unoptimized scalar. After NEON-optimizing fp32 too (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We caught the wrong claim during validation and corrected it before publishing widely.
8+
9+
After 9 rounds of Karpathy iteration, all three production turbo_kv types now run within 8–12% of fp32 KV speed while compressing 5.8–9.1×:
1110

1211
| Type | Bytes/block | tok/s | vs FP32 | PPL | Δ vs FP32 |
1312
|---|---:|---:|---:|---:|---:|
14-
| FP32 KV || 12.6 | baseline | 13.56 ||
15-
| **`turbo_kv_4b`**| 72 | **13.9** | **+10% ⬆** | 14.33 | +5.7% |
16-
| **`turbo_kv_3b`** | 56 | **13.4** | **+6% ⬆** | 15.36 | +13.3% |
17-
| **`turbo_kv_5b`** 🏆 | 88 | **13.2** | **+5% ⬆** | 13.65 | +0.7% |
13+
| FP32 KV (NEON) || **14.83** | baseline | 13.56 ||
14+
| **`turbo_kv_4b`**| 72 | 13.67 | **−7.8%** | 14.33 | +5.7% |
15+
| **`turbo_kv_3b`** | 56 | 13.4 | −9.6% | 15.36 | +13.3% |
16+
| **`turbo_kv_5b`** 🏆 | 88 | 13.13 | −11.5% | 13.65 | +0.7% |
1817

1918
### What changed (Round 5: the real bottleneck)
2019

README.ko.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -43,22 +43,23 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
4343

4444
> **같은 하드웨어. 4–7배 긴 컨텍스트. PPL 측정 + 공개.**
4545
46-
### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV = 13.56 PPL @ 12.6 tok/s)
46+
### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
4747

48-
> **`turbo_kv_4b`는 fp32 KV보다 7배 더 압축되었으면서 더 빠릅니다** (long context 기준). Karpathy 루프로 속도 격차를 완전히 없앴습니다.
48+
> 9 라운드 Karpathy 루프로 quant-KV vs FP32-KV 속도 격차를 **−45%에서 −8%로** 줄였습니다. 5.8–7.1× 메모리 압축. fp32 raw 속도를 능가하지는 못하지만 **8% 이내까지 따라잡음.**
4949
5050
| KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 | tok/s | vs FP32 속도 |
5151
|:--------|----:|----:|----:|----:|----:|----:|
52-
| FP32 reference ||| 13.56 || 12.6 | baseline |
53-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%**|
54-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
55-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
56-
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%**|
57-
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
58-
| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%**|
59-
| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% |||
60-
61-
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. **둘 다 5.8–7.1× 압축 + uncompressed FP32 KV보다 빠름.** 전체 Karpathy 루프 이력 (9 rounds across 3 sessions): [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
52+
| FP32 reference (NEON) ||| 13.56 || 14.83 | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
55+
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
56+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
57+
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
58+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
59+
60+
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto 추천: **5.8–7.1× 메모리 압축 + FP32 KV 92% 속도.** 전체 Karpathy 루프 이력: [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
61+
62+
> **이 비교에 대해**: v0.6.3 릴리스 노트에서 처음 "turbo_kv가 fp32 KV 속도를 능가"라고 주장했습니다. 그건 fp32 attention path가 scalar였기 때문에 발생한 artifact였고, fp32 path에 NEON을 추가한 후(commit `4490c83`) 정직한 격차는 `+5~10%`가 아닌 `−7~−12%`입니다. README와 v0.6.3 릴리스 노트를 정정했습니다.
6263
6364
### 컨텍스트 길이 증가 (`turbo_kv_4b` + `q4` value cache)
6465

README.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -43,37 +43,38 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
4343

4444
> **Same hardware. 4–7x longer context. PPL measured and disclosed.**
4545
46-
### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV = 13.56 PPL @ 12.6 tok/s)
46+
### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
4747

48-
> **`turbo_kv_4b` is now both 7× more compressed AND faster than fp32 KV** at long context. The Karpathy loop closed the speed gap completely (PPL eval throughput).
48+
> 9 rounds of Karpathy iteration closed the quant-KV speed gap to FP32 KV from **−45% to −8%**, while delivering 5.8–7.1× memory compression. We do not (yet) beat fp32 in raw speed, but we get within 8% of it for ~7× less memory.
4949
5050
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
5151
|:----------|------------:|------------:|----:|----------:|------:|--------------:|
52-
| FP32 reference ||| 13.56 || 12.6 | baseline |
53-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%** |
54-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
55-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
56-
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%** |
57-
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
58-
| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%** |
59-
| llama.cpp `q4_0` KV (lit. survey) | ~70 | ~7.3× | ~14.99 | +10.6% |||
52+
| FP32 reference (NEON) ||| 13.56 || 14.83 | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
55+
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
56+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
57+
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
58+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
59+
| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% |||
6060

6161
```
62-
PPL Degradation vs FP32 Throughput vs FP32
62+
PPL Degradation vs FP32 Speed vs FP32 KV
6363
(lower is better) (higher is better)
6464
65-
turbo_kv_5b │█ +0.7% ████████████ +5% ⬆
66-
turbo_kv_4bo │██▌ +2.5% ██████████▌ +1%
67-
turbo_kv_3bo │████▌ +4.5% ██████████ -26% ↓
68-
turbo_kv_4b ⭐ │█████ +5.7% ████████████▌ +10% ⬆
69-
uniform_4b │██████ +7.7% ████████████ -7%
70-
turbo_kv_3b │█████████████ +13.3% ████████████ +6% ⬆
65+
turbo_kv_5b │█ +0.7% █████████ −11.5%
66+
turbo_kv_4bo │██▌ +2.5% ████████ −14%
67+
turbo_kv_4b ⭐ │█████ +5.7% ██████████ −7.8%
68+
turbo_kv_3b │█████████████ +13.3% █████████ −9.6%
69+
uniform_4b │██████ +7.7% ███████ −21%
7170
llama.cpp q4_0 │██████████ +10.6% — (not measured)
72-
FP32 reference │ ← 0% 12.6 tok/s
73-
0% +5% +10% 9 10 11 12 13 14
71+
FP32 reference │ ← 0% 14.83 tok/s
72+
0% +5% +10% 0 25% 50% 75% 100%
7473
```
7574

76-
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations. **Both compress 5.8–7.1× and run faster than uncompressed FP32 KV.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
75+
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations: **5.8–7.1× memory compression at 92% of FP32 KV speed.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
76+
77+
> **About this comparison**: We previously published v0.6.3 release notes claiming `turbo_kv` beats `fp32` KV speed. That was an artifact of the fp32 attention path being unoptimized scalar — once we added NEON to the fp32 path (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We've corrected the README and the v0.6.3 release notes.
7778
7879
### Context length gains (`turbo_kv_4b` + `q4` value cache)
7980

0 commit comments

Comments
 (0)