Skip to content

Commit 83f37fd

Browse files
unamedkrclaude
andcommitted
README: showcase the speed breakthrough — turbo_kv beats fp32
Updated the headline KV comparison table to show the new throughput column. turbo_kv_4b/3b/5b now all run faster than uncompressed FP32 KV on Llama 3.2 3B PPL eval, with 5.8–9.1× compression. The visualization includes both PPL degradation and throughput in a single ASCII chart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c135ad9 commit 83f37fd

2 files changed

Lines changed: 43 additions & 65 deletions

File tree

README.ko.md

Lines changed: 16 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -43,40 +43,22 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
4343

4444
> **같은 하드웨어. 4–7배 긴 컨텍스트. PPL 측정 + 공개.**
4545
46-
### Llama 3.2 3B Instruct — WikiText PPL (FP32 베이스라인 = 13.56)
47-
48-
```
49-
FP32 대비 PPL 저하 (낮을수록 좋음)
50-
51-
llama.cpp Q4_0 KV │██████████████████████████ +10.6% (4-bit, RHT 없음)
52-
53-
uniform_4b │███████████████ +6.3% (4-bit, RHT 없음)
54-
55-
turbo_kv_4b ⭐ 기본 │█████████████ +5.3% (72B/블록)
56-
57-
turbo_kv_3bo 🧪 │█████████ +3.5% (80B/블록, +outliers)
58-
59-
turbo_kv_4bo 🧪 │█████ +2.2% (96B/블록, +outliers)
60-
61-
turbo_kv_5b 🏆 quality│█ +0.34% (88B/블록, 거의 무손실)
62-
63-
FP32 reference │ ← 0.0% (양자화 없음)
64-
└─────────────────────────────────────
65-
0% +5% +10%
66-
```
67-
68-
| KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 |
69-
|:--------|----:|----:|----:|----:|
70-
| FP32 reference ||| 13.56 ||
71-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.60** | **+0.34%** |
72-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.86 | +2.2% |
73-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.03 | +3.5% |
74-
| **`turbo_kv_4b`** ⭐ 기본 | 72 | 7.1× | 14.28 | +5.3% |
75-
| `uniform_4b` | 68 | 7.5× | 14.41 | +6.3% |
76-
| llama.cpp `q4_0` KV | ~70 | ~7.3× | ~14.99 | +10.6% |
77-
| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
78-
79-
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. 두 가지 모두 llama.cpp `q4_0` KV를 같거나 작은 블록 사이즈에서 능가. 전체 Karpathy 루프 이력은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
46+
### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV = 13.56 PPL @ 12.6 tok/s)
47+
48+
> **`turbo_kv_4b`는 fp32 KV보다 7배 더 압축되었으면서 더 빠릅니다** (long context 기준). Karpathy 루프로 속도 격차를 완전히 없앴습니다.
49+
50+
| KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 | tok/s | vs FP32 속도 |
51+
|:--------|----:|----:|----:|----:|----:|----:|
52+
| FP32 reference ||| 13.56 || 12.6 | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%**|
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
55+
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
56+
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%**|
57+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
58+
| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%**|
59+
| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% |||
60+
61+
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. **둘 다 5.8–7.1× 압축 + uncompressed FP32 KV보다 빠름.** 전체 Karpathy 루프 이력 (9 rounds across 3 sessions): [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
8062

8163
### 컨텍스트 길이 증가 (`turbo_kv_4b` + `q4` value cache)
8264

README.md

Lines changed: 27 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -43,41 +43,37 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
4343

4444
> **Same hardware. 4–7x longer context. PPL measured and disclosed.**
4545
46-
### Llama 3.2 3B Instruct — PPL on WikiText (FP32 baseline = 13.56)
46+
### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV = 13.56 PPL @ 12.6 tok/s)
47+
48+
> **`turbo_kv_4b` is now both 7× more compressed AND faster than fp32 KV** at long context. The Karpathy loop closed the speed gap completely (PPL eval throughput).
49+
50+
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
51+
|:----------|------------:|------------:|----:|----------:|------:|--------------:|
52+
| FP32 reference ||| 13.56 || 12.6 | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.2** | **+5%**|
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | +1% |
55+
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | -26% |
56+
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.9** | **+10%**|
57+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | -7% |
58+
| **`turbo_kv_3b`** | **56** | **9.1×** | 15.36 | +13.3% | **13.4** | **+6%**|
59+
| llama.cpp `q4_0` KV (lit. survey) | ~70 | ~7.3× | ~14.99 | +10.6% |||
4760

4861
```
49-
PPL Degradation vs FP32
50-
(lower is better)
51-
52-
llama.cpp Q4_0 KV │██████████████████████████ +10.6% (4-bit, no RHT)
53-
54-
uniform_4b │███████████████ +6.3% (4-bit, no RHT)
55-
56-
turbo_kv_4b ⭐ default│█████████████ +5.3% (72B/block)
57-
58-
turbo_kv_3bo 🧪 │█████████ +3.5% (80B/block, +outliers)
59-
60-
turbo_kv_4bo 🧪 │█████ +2.2% (96B/block, +outliers)
61-
62-
turbo_kv_5b 🏆 quality│█ +0.34% (88B/block, near-lossless)
63-
64-
FP32 reference │ ← 0.0% (no quantization)
65-
└─────────────────────────────────────
66-
0% +5% +10%
62+
PPL Degradation vs FP32 Throughput vs FP32
63+
(lower is better) (higher is better)
64+
65+
turbo_kv_5b │█ +0.7% ████████████ +5% ⬆
66+
turbo_kv_4bo │██▌ +2.5% ██████████▌ +1%
67+
turbo_kv_3bo │████▌ +4.5% ██████████ -26% ↓
68+
turbo_kv_4b ⭐ │█████ +5.7% ████████████▌ +10% ⬆
69+
uniform_4b │██████ +7.7% ████████████ -7%
70+
turbo_kv_3b │█████████████ +13.3% ████████████ +6% ⬆
71+
llama.cpp q4_0 │██████████ +10.6% — (not measured)
72+
FP32 reference │ ← 0% 12.6 tok/s
73+
0% +5% +10% 9 10 11 12 13 14
6774
```
6875

69-
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 |
70-
|:----------|------------:|------------:|----:|----------:|
71-
| FP32 reference ||| 13.56 ||
72-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.60** | **+0.34%** |
73-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.86 | +2.2% |
74-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.03 | +3.5% |
75-
| **`turbo_kv_4b`** ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
76-
| `uniform_4b` | 68 | 7.5× | 14.41 | +6.3% |
77-
| llama.cpp `q4_0` KV | ~70 | ~7.3× | ~14.99 | +10.6% |
78-
| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
79-
80-
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the recommended Pareto-optimal choices. Both beat llama.cpp's `q4_0` KV at the same or smaller block size on Llama 3.2 3B perplexity. The full Karpathy-loop optimization history is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
76+
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations. **Both compress 5.8–7.1× and run faster than uncompressed FP32 KV.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
8177

8278
### Context length gains (`turbo_kv_4b` + `q4` value cache)
8379

0 commit comments

Comments
 (0)