|
| 1 | +# TurboQuant Gap Analysis: Paper vs quant.cpp |
| 2 | + |
| 3 | +> Comparison of Google TurboQuant (Zandieh, Daliri, Hadian, Mirrokni — ICLR 2026, [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)) against the current `tq_polar` / `tq_qjl` / `tq_turbo` implementations in quant.cpp. |
| 4 | +
|
| 5 | +## TL;DR |
| 6 | + |
| 7 | +quant.cpp's existing `TQ_TURBO_*` types implement an *earlier* generation of the algorithm — specifically the PolarQuant paper ([arXiv:2502.02617](https://arxiv.org/abs/2502.02617)) plus QJL residual. **They do not implement the algorithm published as "TurboQuant" by Google in April 2026.** |
| 8 | + |
| 9 | +The 2-stage architecture (large quantizer + 1-bit QJL residual) is correct, but Stage 1 is the wrong quantizer. We need a new type — `TQ_TURBOQUANT_*` — that implements Google's actual algorithm. |
| 10 | + |
| 11 | +## Algorithm comparison |
| 12 | + |
| 13 | +| Component | Google TurboQuant (ICLR 2026) | quant.cpp `TQ_TURBO_*` (current) | |
| 14 | +|---|---|---| |
| 15 | +| **Stage 1 transform** | Random orthogonal rotation Π via QR(N(0,1)) | Polar coordinate conversion `(x,y) → (r,θ)` | |
| 16 | +| **Stage 1 quantizer** | Lloyd-Max scalar quantizer (Beta-distribution-aware centroids) | Min-max linear quantization | |
| 17 | +| **Stage 1 bit budget** | (b−1) bits per coordinate (e.g. 2.5 bits at 3.5-bit total) | 2 bits θ + 2 bits r per pair = 2 bpc | |
| 18 | +| **Stage 2 residual** | 1-bit QJL on residual + ‖r‖₂ scalar | QJL on residual (no explicit norm) | |
| 19 | +| **Block size** | None — operates on full d-dim vector | TQ_BK = 128 | |
| 20 | +| **Outlier handling** | Per-channel: 32 outlier channels at higher bit width | None | |
| 21 | +| **Inner product estimator** | ⟨y, x̃_mse⟩ + ‖r‖₂·⟨y, Q_qjl⁻¹(Q_qjl(r))⟩ | ⟨y, x̃_polar + x̃_qjl⟩ (no norm) | |
| 22 | +| **Centroid storage** | Precomputed (Lloyd-Max output cached) | Per-block min/max in FP16 | |
| 23 | + |
| 24 | +## What's missing |
| 25 | + |
| 26 | +### 1. Random rotation matrix Π |
| 27 | + |
| 28 | +Google generates Π by QR-decomposing a Gaussian random matrix. For our embedded targets we need a fast deterministic alternative: |
| 29 | + |
| 30 | +- **Hadamard transform** (Walsh-Hadamard) — already used by Arclabs001's research (cited in llama.cpp #20969 as actually outperforming random rotation in their tests) |
| 31 | +- **Householder reflectors** (small constant memory) |
| 32 | +- **Givens rotation network** |
| 33 | +- **Pseudo-random Rademacher diagonal × WHT × Rademacher** — what our `tq_rht.c` already implements! |
| 34 | + |
| 35 | +Our `tq_rht.c` (Random Hadamard Transform) is exactly what Google's "random rotation" needs. We just don't compose it with the right quantizer. |
| 36 | + |
| 37 | +### 2. Lloyd-Max scalar quantizer |
| 38 | + |
| 39 | +After random rotation, the rotated coordinates follow a (concentrated) Beta distribution. The optimal quantizer for this distribution is **not** uniform min-max — it's Lloyd-Max with precomputed centroids: |
| 40 | + |
| 41 | +- 1-bit: `{±√(2/πd)}` |
| 42 | +- 2-bit: `{±0.453/√d, ±1.51/√d}` |
| 43 | +- 3-bit, 4-bit, 5-bit: derived numerically from Beta(d/2, d/2) |
| 44 | + |
| 45 | +We already have `src/core/tq_codebook.c` for codebook quantization — we just need to populate it with the Lloyd-Max centroids the paper specifies (or computes). |
| 46 | + |
| 47 | +### 3. Stored ‖r‖₂ |
| 48 | + |
| 49 | +The residual norm is a single FP16 scalar per vector. We don't store this currently. Trivial to add to the `block_tq_turbo` struct. |
| 50 | + |
| 51 | +### 4. Inner product estimator |
| 52 | + |
| 53 | +The current `tq_turbo_attention_ref` does straight dot product on the dequantized sum. Google's estimator combines the two stages explicitly with the residual norm. This affects accuracy at low bit budgets. |
| 54 | + |
| 55 | +## What we already have (good news) |
| 56 | + |
| 57 | +| Building block | Status | |
| 58 | +|---|---| |
| 59 | +| `tq_rht.c` Random Hadamard Transform | ✅ Implemented | |
| 60 | +| `tq_qjl.c` Quantized Johnson-Lindenstrauss | ✅ Implemented (1-bit sketch) | |
| 61 | +| `tq_codebook.c` Codebook quantizer infrastructure | ✅ Implemented | |
| 62 | +| `tq_turbo.c` 2-stage composition framework | ✅ Implemented (wrong Stage 1) | |
| 63 | +| Plugin architecture (`tq_traits.c` 3-function registration) | ✅ Implemented | |
| 64 | +| Block-wise KV cache with per-step quantization | ✅ Implemented | |
| 65 | +| Multi-architecture inference (Llama 3, Qwen, Gemma, etc.) | ✅ Implemented | |
| 66 | + |
| 67 | +We have ~90% of the infrastructure. The missing 10% is: |
| 68 | +1. Compose RHT + Lloyd-Max into a new `TQ_TURBOQUANT_*` type |
| 69 | +2. Precompute Lloyd-Max centroids |
| 70 | +3. Wire the proper inner product estimator |
| 71 | + |
| 72 | +## Implementation plan |
| 73 | + |
| 74 | +### Phase 1: Add `TQ_TURBOQUANT_3B` and `TQ_TURBOQUANT_4B` types (1–2 days) |
| 75 | + |
| 76 | +```c |
| 77 | +// New block layout |
| 78 | +typedef struct { |
| 79 | + uint16_t residual_norm_fp16; // ‖r‖₂ stored as FP16 |
| 80 | + uint8_t rotated_quant[BLOCK_BYTES]; // Lloyd-Max codes after rotation |
| 81 | + uint8_t qjl_residual[QJL_BYTES]; // 1-bit QJL on residual |
| 82 | +} block_tq_turboquant_3b; |
| 83 | + |
| 84 | +void tq_turboquant_quantize_ref(const float* src, void* dst, int n) { |
| 85 | + // 1. Apply random Hadamard transform: x̃ = Π·x |
| 86 | + float rotated[MAX_DIM]; |
| 87 | + tq_rht_apply(src, rotated, n, /*seed=*/FIXED_SEED); |
| 88 | + |
| 89 | + // 2. Lloyd-Max quantize each coordinate |
| 90 | + quantize_lloyd_max(rotated, block->rotated_quant, n, /*bits=*/2); |
| 91 | + |
| 92 | + // 3. Compute residual r = x̃ − dequant(rotated_quant) |
| 93 | + float recon[MAX_DIM], residual[MAX_DIM]; |
| 94 | + dequant_lloyd_max(block->rotated_quant, recon, n, 2); |
| 95 | + for (int i = 0; i < n; i++) residual[i] = rotated[i] - recon[i]; |
| 96 | + |
| 97 | + // 4. Store ‖r‖₂ in FP16 |
| 98 | + float r_norm = vec_l2_norm(residual, n); |
| 99 | + block->residual_norm_fp16 = fp32_to_fp16(r_norm); |
| 100 | + |
| 101 | + // 5. Normalize residual and apply 1-bit QJL |
| 102 | + for (int i = 0; i < n; i++) residual[i] /= r_norm; |
| 103 | + tq_qjl_quantize_ref(residual, block->qjl_residual, n); |
| 104 | +} |
| 105 | +``` |
| 106 | +
|
| 107 | +### Phase 2: Precompute Lloyd-Max centroids (4 hours) |
| 108 | +
|
| 109 | +Run Lloyd-Max iteration offline for d=64, 128, 256, 512, 1024 at b=1,2,3,4,5 bits. Store as static const float arrays in `tq_codebook.c`. Cite the closed-form approximation from the paper for spot-check. |
| 110 | +
|
| 111 | +### Phase 3: Inner product estimator (4 hours) |
| 112 | +
|
| 113 | +```c |
| 114 | +void tq_turboquant_attention_ref(const float* query, const void* kv, |
| 115 | + float* scores, int seq_len, int head_dim) { |
| 116 | + // Apply same rotation to query (or use pre-rotated query) |
| 117 | + float q_rot[MAX_DIM]; |
| 118 | + tq_rht_apply(query, q_rot, head_dim, FIXED_SEED); |
| 119 | +
|
| 120 | + for (int s = 0; s < seq_len; s++) { |
| 121 | + const block_tq_turboquant_3b* b = &kv[s]; |
| 122 | +
|
| 123 | + // Stage 1: ⟨q_rot, x̃_mse⟩ |
| 124 | + float dot1 = 0; |
| 125 | + float k_recon[MAX_DIM]; |
| 126 | + dequant_lloyd_max(b->rotated_quant, k_recon, head_dim, 2); |
| 127 | + for (int d = 0; d < head_dim; d++) dot1 += q_rot[d] * k_recon[d]; |
| 128 | +
|
| 129 | + // Stage 2: ‖r‖₂·⟨q_rot, Q_qjl⁻¹(qjl)⟩ |
| 130 | + float r_norm = fp16_to_fp32(b->residual_norm_fp16); |
| 131 | + float qjl_dot = tq_qjl_dot_with_query(q_rot, b->qjl_residual, head_dim); |
| 132 | +
|
| 133 | + scores[s] = dot1 + r_norm * qjl_dot; |
| 134 | + } |
| 135 | +} |
| 136 | +``` |
| 137 | + |
| 138 | +### Phase 4: Validation against the paper (2 days) |
| 139 | + |
| 140 | +Reproduce the paper's headline numbers within ±1%: |
| 141 | + |
| 142 | +| Model | Method | Paper LongBench-E | Our target | |
| 143 | +|---|---|---|---| |
| 144 | +| Llama-3.1-8B | TurboQuant 2.5-bit | 49.44 | 48.4–50.4 | |
| 145 | +| Llama-3.1-8B | TurboQuant 3.5-bit | 50.06 | 49.0–51.0 | |
| 146 | +| Llama-3.1-8B | Full cache | 50.06 | (baseline) | |
| 147 | +| Ministral-7B | TurboQuant 2.5-bit | 49.62 | 48.6–50.6 | |
| 148 | +| Ministral-7B | Full cache | 49.89 | (baseline) | |
| 149 | + |
| 150 | +Plus Needle-in-Haystack at 0.997 vs 1.000 baseline. |
| 151 | + |
| 152 | +### Phase 5: llama.cpp PR (3–5 days) |
| 153 | + |
| 154 | +Once Phase 4 passes, port the kernel to ggml as a new `GGML_TYPE_TURBOQUANT_K` and submit to [Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969). Differentiator vs the existing forks: (a) reproduces paper numbers, (b) clean ggml type registration, (c) backed by an end-to-end working C reference. |
| 155 | + |
| 156 | +## Naming hygiene going forward |
| 157 | + |
| 158 | +| Current name | Keeps | Future name | |
| 159 | +|---|---|---| |
| 160 | +| `TQ_POLAR_3B` / `TQ_POLAR_4B` | ✅ kept | "PolarQuant-style: polar coordinate quantization (predates Google paper)" | |
| 161 | +| `TQ_TURBO_3B` / `TQ_TURBO_4B` | ✅ kept | "Turbo-style: PolarQuant + QJL residual (our original composition)" | |
| 162 | +| (new) `TQ_TURBOQUANT_3B` / `TQ_TURBOQUANT_4B` | new | "Google TurboQuant (ICLR 2026): RHT + Lloyd-Max + 1-bit QJL residual + ‖r‖ scalar" | |
| 163 | + |
| 164 | +This way users can choose between the two compositions and compare directly. |
| 165 | + |
| 166 | +## Expected outcome |
| 167 | + |
| 168 | +After Phase 1–4, quant.cpp becomes the **first single-header C implementation** of the published TurboQuant algorithm with reproduced paper numbers. After Phase 5, we have a credible llama.cpp PR with the strongest narrative in Discussion #20969. |
0 commit comments