bench: llama.cpp full KV type PPL comparison (q4_0 to q8_0)

unamedkr · claude · unamedkr · commit 18eeed15af69 · 2026-04-03T08:34:27.000+09:00
SmolLM2 1.7B, 2K tokens:
  f16: 2.83, q8_0: 2.82, q5_1: 2.86, q5_0: 2.85, q4_1: 2.92, q4_0: 3.13

Q5 types show &lt;1% PPL loss. Q4_0 shows +10.6%.
This provides the baseline for comparison with TurboQuant approaches.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/results/ppl_comparison.md b/bench/results/ppl_comparison.md
@@ -32,3 +32,14 @@ The KEY metric is the DELTA from each engine's own baseline.
 
 TurboQuant achieves 4x more compression on keys with zero PPL increase,
 while llama.cpp's Q4 KV shows measurable quality degradation.
+
+## llama.cpp Full KV Type Comparison (SmolLM2 1.7B, 2K tokens)
+
+| KV Type | PPL | Delta vs F16 | Bits/element |
+|---------|-----|-------------|--------------|
+| f16 (baseline) | 2.83 | — | 16 |
+| q8_0 | 2.82 | -0.4% | 8 |
+| q5_1 | 2.86 | +0.9% | 5 |
+| q5_0 | 2.85 | +0.6% | 5 |
+| q4_1 | 2.92 | +3.2% | 4 |
+| q4_0 | 3.13 | +10.6% | 4 |