You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README: showcase the speed breakthrough — turbo_kv beats fp32
Updated the headline KV comparison table to show the new throughput
column. turbo_kv_4b/3b/5b now all run faster than uncompressed FP32 KV
on Llama 3.2 3B PPL eval, with 5.8–9.1× compression. The visualization
includes both PPL degradation and throughput in a single ASCII chart.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. 두 가지 모두 llama.cpp `q4_0` KV를 같거나 작은 블록 사이즈에서 능가. 전체 Karpathy 루프 이력은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
> **`turbo_kv_4b` is now both 7× more compressed AND faster than fp32 KV** at long context. The Karpathy loop closed the speed gap completely (PPL eval throughput).
49
+
50
+
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the recommended Pareto-optimal choices. Both beat llama.cpp's `q4_0` KV at the same or smaller block size on Llama 3.2 3B perplexity. The full Karpathy-loop optimization history is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
76
+
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations. **Both compress 5.8–7.1× and run faster than uncompressed FP32 KV.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
81
77
82
78
### Context length gains (`turbo_kv_4b` + `q4` value cache)
0 commit comments