README + custom-quantization tutorial polish

unamedkr · claude · unamedkr · commit 1db8a55f7914 · 2026-04-08T08:19:16.000+09:00
README.md / README.ko.md:
- Replace plain comparison table with visual ASCII bar chart of PPL
  degradation across all KV types (FP32 → 5b → 4bo → 3bo → 4b →
  uniform_4b → llama.cpp q4_0 → 3b)
- Expanded quality table with bytes/block + compression columns
- Pareto-optimal recommendations called out (4b default, 5b quality)

docs/custom-quantization.md:
- Updated reference types table with measured PPL deltas where known,
  bytes/block, and pattern descriptions
- Added "How the production winners were found" section showing the
  6-round Karpathy loop history that produced Variant F
- Concrete iteration loop guide for contributors adding new types

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -43,17 +43,40 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
 
 > **같은 하드웨어. 4–7배 긴 컨텍스트. PPL 측정 + 공개.**
 
-### Llama 3.2 3B Instruct, FP32 KV 베이스라인 = PPL 13.56
+### Llama 3.2 3B Instruct — WikiText PPL (FP32 베이스라인 = 13.56)
 
-| KV 설정 | bits/elem | PPL | Δ vs FP32 | 비고 |
-|:--------|----------:|----:|----------:|:------|
-| FP32 (베이스라인) | 32 | 13.56 | — | reference |
-| **`turbo_kv_4b`** ⭐ | 4 | **14.28** | **+5.3%** | RHT + 4-bit codebook, uniform_4b 능가 |
-| `uniform_4b` | 4 | 14.41 | +6.3% | per-block min-max |
-| `turbo_kv_3b` | 3 | 15.39 | +13.5% | RHT + 3-bit codebook |
-| llama.cpp `q4_0` KV | 4 | ~14.99 | +10.6% | 비교용 |
+```
+                              FP32 대비 PPL 저하 (낮을수록 좋음)
+
+  llama.cpp Q4_0 KV    │██████████████████████████ +10.6%   (4-bit, RHT 없음)
+                       │
+  uniform_4b           │███████████████ +6.3%               (4-bit, RHT 없음)
+                       │
+  turbo_kv_4b ⭐ 기본   │█████████████ +5.3%                 (72B/블록)
+                       │
+  turbo_kv_3bo 🧪      │█████████ +3.5%                     (80B/블록, +outliers)
+                       │
+  turbo_kv_4bo 🧪      │█████ +2.2%                         (96B/블록, +outliers)
+                       │
+  turbo_kv_5b 🏆 quality│█ +0.34%                            (88B/블록, 거의 무손실)
+                       │
+  FP32 reference       │ ← 0.0%                             (양자화 없음)
+                       └─────────────────────────────────────
+                        0%      +5%      +10%
+```
 
-`turbo_kv_4b`가 프로젝트 내 **최고 4-bit KV 양자화** — 같은 비트 예산에서 이전 production 베이스라인(`uniform_4b`)과 llama.cpp `q4_0` KV를 모두 능가합니다. Karpathy 루프로 도달한 과정은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+| KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 |
+|:--------|----:|----:|----:|----:|
+| FP32 reference | — | 1× | 13.56 | — |
+| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.60** | **+0.34%** |
+| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.86 | +2.2% |
+| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.03 | +3.5% |
+| **`turbo_kv_4b`** ⭐ 기본 | 72 | 7.1× | 14.28 | +5.3% |
+| `uniform_4b` | 68 | 7.5× | 14.41 | +6.3% |
+| llama.cpp `q4_0` KV | ~70 | ~7.3× | ~14.99 | +10.6% |
+| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
+
+`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto-optimal 추천. 두 가지 모두 llama.cpp `q4_0` KV를 같거나 작은 블록 사이즈에서 능가. 전체 Karpathy 루프 이력은 [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
 
 ### 컨텍스트 길이 증가 (`turbo_kv_4b` + `q4` value cache)
 
diff --git a/README.md b/README.md
@@ -43,17 +43,41 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
 
 > **Same hardware. 4–7x longer context. PPL measured and disclosed.**
 
-### Llama 3.2 3B Instruct, FP32 KV baseline = PPL 13.56
+### Llama 3.2 3B Instruct — PPL on WikiText (FP32 baseline = 13.56)
 
-| KV Config | Bits/elem | PPL | Δ vs FP32 | Notes |
-|:----------|----------:|----:|----------:|:------|
-| FP32 (baseline) | 32 | 13.56 | — | reference |
-| **`turbo_kv_4b`** ⭐ | 4 | **14.28** | **+5.3%** | RHT + 4-bit codebook, beats uniform_4b |
-| `uniform_4b` | 4 | 14.41 | +6.3% | per-block min-max |
-| `turbo_kv_3b` | 3 | 15.39 | +13.5% | RHT + 3-bit codebook |
-| llama.cpp `q4_0` KV | 4 | ~14.99 | +10.6% | for comparison |
+```
+                              PPL Degradation vs FP32
+                              (lower is better)
+
+  llama.cpp Q4_0 KV    │██████████████████████████ +10.6%   (4-bit, no RHT)
+                       │
+  uniform_4b           │███████████████ +6.3%               (4-bit, no RHT)
+                       │
+  turbo_kv_4b ⭐ default│█████████████ +5.3%                 (72B/block)
+                       │
+  turbo_kv_3bo 🧪      │█████████ +3.5%                     (80B/block, +outliers)
+                       │
+  turbo_kv_4bo 🧪      │█████ +2.2%                         (96B/block, +outliers)
+                       │
+  turbo_kv_5b 🏆 quality│█ +0.34%                            (88B/block, near-lossless)
+                       │
+  FP32 reference       │ ←   0.0%                           (no quantization)
+                       └─────────────────────────────────────
+                        0%      +5%      +10%
+```
 
-`turbo_kv_4b` is currently the **best 4-bit KV cache quantization in the project** — it beats both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget. The Karpathy-loop history that produced it is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
+| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 |
+|:----------|------------:|------------:|----:|----------:|
+| FP32 reference | — | 1× | 13.56 | — |
+| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.60** | **+0.34%** |
+| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.86 | +2.2% |
+| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.03 | +3.5% |
+| **`turbo_kv_4b`** ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
+| `uniform_4b` | 68 | 7.5× | 14.41 | +6.3% |
+| llama.cpp `q4_0` KV | ~70 | ~7.3× | ~14.99 | +10.6% |
+| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
+
+`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the recommended Pareto-optimal choices. Both beat llama.cpp's `q4_0` KV at the same or smaller block size on Llama 3.2 3B perplexity. The full Karpathy-loop optimization history is in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
 
 ### Context length gains (`turbo_kv_4b` + `q4` value cache)
 
diff --git a/docs/custom-quantization.md b/docs/custom-quantization.md
@@ -437,16 +437,46 @@ The scoring harness checks:
 
 Study these files for implementation patterns:
 
-| Type Family | Source File | Complexity | Good Starting Point? |
-|-------------|------------|-----------|---------------------|
-| Uniform 4-bit | `src/core/tq_uniform.c` | Simple | Yes -- closest to the example above |
-| Uniform 2-bit | `src/core/tq_uniform.c` | Medium | Yes -- shows sub-block scales |
-| Uniform 3-bit | `src/core/tq_uniform.c` | Medium | Shows 3-bit packing (non-power-of-2) |
-| Polar | `src/core/tq_polar.c` | Complex | Advanced -- polar coordinate encoding |
-| QJL | `src/core/tq_qjl.c` | Complex | Advanced -- random projection hashing |
-| Turbo | `src/core/tq_turbo.c` | Complex | Advanced -- composite (Polar + QJL residual) |
-| TurboKV | `src/core/tq_turbo.c` | Complex | Advanced -- RHT + codebook + QJL |
-| Mixed | `src/core/tq_uniform.c` | Medium | Shows outlier handling |
+| Type Family | Source File | Bytes/block | Llama 3.2 3B PPL Δ | Complexity | Pattern |
+|-------------|------------|------------:|-------------------:|-----------|---------|
+| `uniform_4b` | `src/core/tq_uniform.c` | 68 | +6.3% | Simple | Per-block min/max linear |
+| `uniform_2b` | `src/core/tq_uniform.c` | 36 | — | Medium | Per-sub-block scales |
+| `uniform_3b` | `src/core/tq_uniform.c` | 52 | — | Medium | Non-power-of-2 packing |
+| `polar_3b` / `polar_4b` | `src/core/tq_polar.c` | 72 | — | Complex | Polar coordinates `(r, θ)` |
+| `qjl_1b` | `src/core/tq_qjl.c` | 36 | — | Complex | Sign-hash random projection |
+| `turbo_3b` / `turbo_4b` | `src/core/tq_turbo.c` | 96 | — | Complex | Composite (Polar + QJL residual, legacy) |
+| **`turbo_kv_4b` ⭐** | `src/core/tq_turbo_kv.c` | 72 | **+5.3%** | Medium | **RHT + 4-bit Lloyd-Max codebook (Variant F)** |
+| **`turbo_kv_5b` 🏆** | `src/core/tq_turbo_kv.c` | 88 | **+0.34%** | Medium | RHT + 5-bit Lloyd-Max codebook |
+| `turbo_kv_3b` | `src/core/tq_turbo_kv.c` | 56 | +13.5% | Medium | RHT + 3-bit Lloyd-Max codebook |
+| `turbo_kv_4bo` 🧪 | `src/core/tq_turbo_kv.c` | 96 | +2.2% | Medium | 4b base + 8 per-block FP16 outliers |
+| `turbo_kv_3bo` 🧪 | `src/core/tq_turbo_kv.c` | 80 | +3.5% | Medium | 3b base + 8 per-block FP16 outliers |
+| `turbo_kv_1b` | `src/core/tq_turbo_kv.c` | 24 | — | Medium | 1-bit sign hash (Hamming attention) |
+| `mixed_4b8` | `src/core/tq_uniform.c` | — | — | Medium | 4-bit base + FP16 outlier table |
+
+### How the production winners were found
+
+`turbo_kv_4b` and `turbo_kv_5b` are not just hand-designed types — they're the **outputs of a 6-round Karpathy loop** of empirical iteration on Llama 3.2 3B perplexity:
+
+| Round | Variant | turbo_kv_4b PPL | Decision |
+|---:|---|---:|---|
+| 0 | Literal port (RHT + Lloyd-Max + 1-bit QJL residual) | 16.03 | baseline |
+| 1 | empirical std rescale | 15.87 | keep |
+| 2 | max-abs no-clip rescale | 15.39 | keep |
+| 3 | 99th percentile clipping | 17.24 | revert |
+| 4 | K·std sweep (K ∈ {1.5..4}) | 15.53 (best K=2) | keep |
+| 5 | uniform 8-level linear | 16.28 | revert |
+| **6** | **drop QJL, double codebook size (Variant F)** | **14.28** ✅ | **shipped** |
+
+The full ablation history with measurement methodology is in [bench/results/turboquant_reproduction.md](../bench/results/turboquant_reproduction.md).
+
+If you're adding a new type, you'll likely follow the same loop:
+1. Implement a literal version of your idea
+2. Run `./build/quant model.gguf --ppl bench/data/ppl_1k.txt -k yourtype` to measure
+3. Compare against `turbo_kv_4b` (default)
+4. Iterate one variable at a time, accept improvements, revert regressions
+5. Add a regression test that pins your final quality threshold
+
+The codebase is structured to make this loop fast (build < 30s, PPL test < 2 min on a 3B model).
 
 ### File Checklist