You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README.md / README.ko.md:
- Replace plain comparison table with visual ASCII bar chart of PPL
degradation across all KV types (FP32 → 5b → 4bo → 3bo → 4b →
uniform_4b → llama.cpp q4_0 → 3b)
- Expanded quality table with bytes/block + compression columns
- Pareto-optimal recommendations called out (4b default, 5b quality)
docs/custom-quantization.md:
- Updated reference types table with measured PPL deltas where known,
bytes/block, and pattern descriptions
- Added "How the production winners were found" section showing the
6-round Karpathy loop history that produced Variant F
- Concrete iteration loop guide for contributors adding new types
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`turbo_kv_4b`가 프로젝트 내 **최고 4-bit KV 양자화** — 같은 비트 예산에서 이전 production 베이스라인(`uniform_4b`)과 llama.cpp `q4_0` KV를 모두 능가합니다. Karpathy 루프로 도달한 과정은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. 두 가지 모두 llama.cpp `q4_0` KV를 같거나 작은 블록 사이즈에서 능가. 전체 Karpathy 루프 이력은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
`turbo_kv_4b` is currently the **best 4-bit KV cache quantization in the project** — it beats both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget. The Karpathy-loop history that produced it is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the recommended Pareto-optimal choices. Both beat llama.cpp's `q4_0` KV at the same or smaller block size on Llama 3.2 3B perplexity. The full Karpathy-loop optimization history is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
57
81
58
82
### Context length gains (`turbo_kv_4b` + `q4` value cache)
|`mixed_4b8`|`src/core/tq_uniform.c`| — | — | Medium | 4-bit base + FP16 outlier table |
455
+
456
+
### How the production winners were found
457
+
458
+
`turbo_kv_4b` and `turbo_kv_5b` are not just hand-designed types — they're the **outputs of a 6-round Karpathy loop** of empirical iteration on Llama 3.2 3B perplexity:
0 commit comments