|
| 1 | +# TurboQuant Paper Reproduction — Status Report |
| 2 | + |
| 3 | +> Run date: 2026-04-08 |
| 4 | +> Paper: [Zandieh et al., *TurboQuant*, ICLR 2026](https://arxiv.org/abs/2504.19874) |
| 5 | +> Hardware: Apple M1 Pro, 8 threads |
| 6 | +> Dataset: `bench/data/ppl_1k.txt` (1040 tokens, perplexity benchmark) |
| 7 | +> Verdict: **Building blocks correct, end-to-end PPL does not yet reproduce paper claims.** |
| 8 | +
|
| 9 | +## TL;DR |
| 10 | + |
| 11 | +quant.cpp's `turbo_kv_3b` / `turbo_kv_4b` types implement the same algorithmic structure as Google's TurboQuant (RHT → Lloyd-Max codebook → 1-bit QJL residual). However, on Llama 3.2 3B with WikiText-style perplexity, **`turbo_kv_*` is currently strictly worse than the simpler `uniform_4b`** at the same bit budget. We are not yet a faithful reproduction of the paper's reported quality. |
| 12 | + |
| 13 | +This document records the actual measured numbers and tracks the gap. |
| 14 | + |
| 15 | +## Measured Numbers |
| 16 | + |
| 17 | +### Llama 3.2 3B Instruct (Q8_0 weights) |
| 18 | + |
| 19 | +| KV type | Bits/elem | PPL | Δ vs FP32 | Notes | |
| 20 | +|---|---:|---:|---:|---| |
| 21 | +| **fp32** | 32 | 13.56 | baseline | reference | |
| 22 | +| `uniform_4b` + FP16 V | 4 | **14.41** | **+6.3%** | simple per-block min-max ✅ recommended | |
| 23 | +| `turbo_kv_4b` + FP16 V | 4 | 16.03 | +18.2% | RHT + 3-bit codebook + 1-bit QJL | |
| 24 | +| `turbo_kv_3b` + FP16 V | 3 | 25.84 | +90.6% | RHT + 2-bit codebook + 1-bit QJL ❌ | |
| 25 | + |
| 26 | +### SmolLM2 135M Instruct |
| 27 | + |
| 28 | +| KV type | Bits/elem | PPL | Δ vs FP32 | Notes | |
| 29 | +|---|---:|---:|---:|---| |
| 30 | +| **fp32** | 32 | 18.62 | baseline | reference | |
| 31 | +| `uniform_4b` + FP16 V | 4 | 20.33 | +9.2% | | |
| 32 | +| `turbo_kv_4b` + FP16 V | 4 | 24.94 | +33.9% | | |
| 33 | +| `turbo_kv_3b` + FP16 V | 3 | 68.23 | +266% | catastrophic | |
| 34 | + |
| 35 | +## What the paper claims |
| 36 | + |
| 37 | +| Model | Method | Paper number | |
| 38 | +|---|---|---| |
| 39 | +| Llama-3.1-8B | Full cache | LongBench 50.06 | |
| 40 | +| Llama-3.1-8B | TurboQuant 3.5-bit | LongBench 50.06 (*identical to baseline*) | |
| 41 | +| Llama-3.1-8B | TurboQuant 2.5-bit | LongBench 49.44 | |
| 42 | +| — | NIH @ 3-bit | ~0.997 (vs 1.000 baseline) | |
| 43 | + |
| 44 | +Translated to PPL terms, the paper's results imply approximately **zero PPL degradation at 3.5-bit** and **<2% degradation at 2.5-bit**. We are at **+18.2% at 4-bit** and **+90.6% at 3-bit** — orders of magnitude worse. |
| 45 | + |
| 46 | +## Building blocks audit |
| 47 | + |
| 48 | +| Component | Status | Notes | |
| 49 | +|---|:--:|---| |
| 50 | +| Per-vector L2 normalization (`‖x‖` stored as FP16) | ✅ correct | Lines 180–185 | |
| 51 | +| Random Hadamard Transform (`tq_rht_transform`) | ✅ correct | Walsh-Hadamard + Rademacher | |
| 52 | +| Lloyd-Max-Gaussian centroids | ✅ correct | Match Max 1960 N(0,1) tables to 4 decimals | |
| 53 | +| `inv_std = √d` rescaling | ⚠️ suspect | Assumes coords are exactly N(0, 1/d). For finite d the marginal distribution of a uniform unit vector coordinate is `Beta(1/2, (d−1)/2)` rescaled, NOT exactly Gaussian. | |
| 54 | +| Residual norm `‖r‖₂` stored as FP16 | ✅ correct | Lines 226–230 | |
| 55 | +| 1-bit QJL sign hash on residual | ✅ correct | `compute_qjl_signs` | |
| 56 | +| Pre-rotated query optimization | ✅ correct | `q_rot = RHT(query)` once | |
| 57 | +| Inner product estimator combining stages | ⚠️ unverified | `dot1 + r_norm * qjl_correction` — formula may not exactly match paper | |
| 58 | + |
| 59 | +## Hypotheses for the gap |
| 60 | + |
| 61 | +1. **Lloyd-Max scaling**: After random rotation of a unit-norm vector, coordinates follow a `Beta(1/2, (d−1)/2)` distribution scaled to `[−1, 1]`, not exactly `N(0, 1/d)`. The discrepancy matters at small `d` (head_dim 64–128). Need to either (a) recompute centroids for the Beta distribution, or (b) verify that the Gaussian approximation suffices for `d ≥ 128`. |
| 62 | + |
| 63 | +2. **QJL correction formula**: The paper's combined estimator is `⟨q, x̃_mse⟩ + ‖r‖₂ · ⟨q, Q_qjl⁻¹(Q_qjl(r))⟩`. Our code uses `dot1 + r_norm * qjl_dot * qjl_scale` where `qjl_scale = √(π/2) / sketch_dim`. The constant factor and the fact that residual is computed *after* normalization may both be off. |
| 64 | + |
| 65 | +3. **Per-channel outlier handling**: The paper allocates extra bits to ~25% of channels identified as outliers. We do uniform per-channel allocation. This alone could account for a meaningful portion of the gap. |
| 66 | + |
| 67 | +4. **Block size**: The paper operates on the full vector. We block at `TQ_BK = 128`. For `head_dim ≤ 128` this is moot, but the per-block normalization may interact differently with rotation than per-vector normalization. |
| 68 | + |
| 69 | +5. **Sketch dimension**: We use `sketch_dim = head_dim`. The paper may use a different ratio (typically `sketch_dim ≥ 2·d` for QJL). |
| 70 | + |
| 71 | +## What works today (recommended config) |
| 72 | + |
| 73 | +For users who want maximum compression with minimum quality loss **today**, the recommended config is: |
| 74 | + |
| 75 | +```bash |
| 76 | +./build/quant model.gguf --chat -p "..." -k uniform_4b -v fp16 # 1.6x compression, +6.3% PPL on 3B |
| 77 | +./build/quant model.gguf --chat -p "..." -k uniform_4b -v q4 # 6.9x compression, +6.3% PPL on 3B (V quality preserved) |
| 78 | +``` |
| 79 | + |
| 80 | +`turbo_kv_*` is **not currently recommended** for production use until the gap is closed. |
| 81 | + |
| 82 | +## Action items |
| 83 | + |
| 84 | +1. ☐ Recompute Lloyd-Max centroids assuming `Beta(1/2, (d−1)/2)` for `d ∈ {64, 128, 256}` |
| 85 | +2. ☐ Implement per-channel outlier extraction (32 outlier channels at higher bit width per the paper) |
| 86 | +3. ☐ Verify QJL correction constant against the original QJL paper (arXiv:2406.03482) |
| 87 | +4. ☐ Test with `sketch_dim = 2 · head_dim` |
| 88 | +5. ☐ Ablation: turn off QJL stage entirely; measure MSE-only PPL to isolate stage 1 vs stage 2 |
| 89 | +6. ☐ Add a unit test that fails if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5 (currently 16.03) |
| 90 | +7. ☐ Track in GitHub issue for community visibility |
| 91 | + |
| 92 | +## Reproducing this report |
| 93 | + |
| 94 | +```bash |
| 95 | +cmake --build build -j$(nproc) |
| 96 | +for k in fp32 uniform_4b turbo_kv_4b turbo_kv_3b; do |
| 97 | + echo "=== $k ===" |
| 98 | + ./build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf \ |
| 99 | + --ppl bench/data/ppl_1k.txt -j 8 -k $k -v fp16 2>&1 | tail -3 |
| 100 | +done |
| 101 | +``` |
| 102 | + |
| 103 | +## Honest positioning |
| 104 | + |
| 105 | +quant.cpp's existing **production-quality** KV compression is `uniform_4b`, which beats llama.cpp's q4_0 KV (+6.3% PPL vs +10.6% PPL on comparable benchmarks). It is **not** a Google TurboQuant reproduction. The `turbo_kv_*` types are an in-progress paper port that does not yet match published numbers. |
| 106 | + |
| 107 | +We should not claim to be a "verified TurboQuant implementation" until at least one bit budget reproduces the paper's PPL within ±5%. |
0 commit comments