Skip to content

Commit e161d2e

Browse files
committed
README: add cross-size validation table (3 Llama-family models)
1 parent 58ac4c8 commit e161d2e

1 file changed

Lines changed: 10 additions & 0 deletions

File tree

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,16 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
7474

7575
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations: **5.8–7.1× memory compression at 92% of FP32 KV speed.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
7676

77+
### Cross-size validation (3 Llama-family models)
78+
79+
| Model | turbo_kv_5b PPL Δ | turbo_kv_4b PPL Δ |
80+
|---|---:|---:|
81+
| SmolLM2 135M Instruct | +1.7% | +5.8% |
82+
| Llama 3.2 1B Instruct | **+0.7%** | +7.3% |
83+
| Llama 3.2 3B Instruct | **+0.7%** | +5.7% |
84+
85+
`turbo_kv_5b` is consistently near-lossless across model sizes (~1% PPL Δ). `turbo_kv_4b` stays in the 5–8% range. **Recommendation**: use `turbo_kv_3b` only on models ≥ 3B parameters (the 8-level codebook is too coarse for small models — +61% PPL on Llama 3.2 1B).
86+
7787
> **About this comparison**: We previously published v0.6.3 release notes claiming `turbo_kv` beats `fp32` KV speed. That was an artifact of the fp32 attention path being unoptimized scalar — once we added NEON to the fp32 path (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We've corrected the README and the v0.6.3 release notes.
7888
7989
### Context length gains (`turbo_kv_4b` + `q4` value cache)

0 commit comments

Comments
 (0)