Skip to content

Commit 26c1755

Browse files
unamedkrclaude
andcommitted
Add gap analysis: paper TurboQuant vs current quant.cpp impl
Documents the precise differences between Google's published TurboQuant algorithm (Zandieh et al., ICLR 2026, arXiv:2504.19874) and our existing TQ_TURBO_* types, which actually implement an earlier PolarQuant-based composition. Key finding: our 2-stage architecture (large quantizer + 1-bit QJL residual) is correct, but Stage 1 uses polar coordinate min-max quantization instead of the paper's random rotation + Lloyd-Max quantizer. We need to add a new TQ_TURBOQUANT_* type that composes existing tq_rht.c (Walsh-Hadamard + Rademacher) with Lloyd-Max centroids in tq_codebook.c — both already in the codebase. Contains a 5-phase implementation plan ending in a llama.cpp PR to Discussion #20969. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b17c0f6 commit 26c1755

1 file changed

Lines changed: 168 additions & 0 deletions

File tree

docs/turboquant-gap-analysis.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# TurboQuant Gap Analysis: Paper vs quant.cpp
2+
3+
> Comparison of Google TurboQuant (Zandieh, Daliri, Hadian, Mirrokni — ICLR 2026, [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)) against the current `tq_polar` / `tq_qjl` / `tq_turbo` implementations in quant.cpp.
4+
5+
## TL;DR
6+
7+
quant.cpp's existing `TQ_TURBO_*` types implement an *earlier* generation of the algorithm — specifically the PolarQuant paper ([arXiv:2502.02617](https://arxiv.org/abs/2502.02617)) plus QJL residual. **They do not implement the algorithm published as "TurboQuant" by Google in April 2026.**
8+
9+
The 2-stage architecture (large quantizer + 1-bit QJL residual) is correct, but Stage 1 is the wrong quantizer. We need a new type — `TQ_TURBOQUANT_*` — that implements Google's actual algorithm.
10+
11+
## Algorithm comparison
12+
13+
| Component | Google TurboQuant (ICLR 2026) | quant.cpp `TQ_TURBO_*` (current) |
14+
|---|---|---|
15+
| **Stage 1 transform** | Random orthogonal rotation Π via QR(N(0,1)) | Polar coordinate conversion `(x,y) → (r,θ)` |
16+
| **Stage 1 quantizer** | Lloyd-Max scalar quantizer (Beta-distribution-aware centroids) | Min-max linear quantization |
17+
| **Stage 1 bit budget** | (b−1) bits per coordinate (e.g. 2.5 bits at 3.5-bit total) | 2 bits θ + 2 bits r per pair = 2 bpc |
18+
| **Stage 2 residual** | 1-bit QJL on residual + ‖r‖₂ scalar | QJL on residual (no explicit norm) |
19+
| **Block size** | None — operates on full d-dim vector | TQ_BK = 128 |
20+
| **Outlier handling** | Per-channel: 32 outlier channels at higher bit width | None |
21+
| **Inner product estimator** | ⟨y, x̃_mse⟩ + ‖r‖₂·⟨y, Q_qjl⁻¹(Q_qjl(r))⟩ | ⟨y, x̃_polar + x̃_qjl⟩ (no norm) |
22+
| **Centroid storage** | Precomputed (Lloyd-Max output cached) | Per-block min/max in FP16 |
23+
24+
## What's missing
25+
26+
### 1. Random rotation matrix Π
27+
28+
Google generates Π by QR-decomposing a Gaussian random matrix. For our embedded targets we need a fast deterministic alternative:
29+
30+
- **Hadamard transform** (Walsh-Hadamard) — already used by Arclabs001's research (cited in llama.cpp #20969 as actually outperforming random rotation in their tests)
31+
- **Householder reflectors** (small constant memory)
32+
- **Givens rotation network**
33+
- **Pseudo-random Rademacher diagonal × WHT × Rademacher** — what our `tq_rht.c` already implements!
34+
35+
Our `tq_rht.c` (Random Hadamard Transform) is exactly what Google's "random rotation" needs. We just don't compose it with the right quantizer.
36+
37+
### 2. Lloyd-Max scalar quantizer
38+
39+
After random rotation, the rotated coordinates follow a (concentrated) Beta distribution. The optimal quantizer for this distribution is **not** uniform min-max — it's Lloyd-Max with precomputed centroids:
40+
41+
- 1-bit: `{±√(2/πd)}`
42+
- 2-bit: `{±0.453/√d, ±1.51/√d}`
43+
- 3-bit, 4-bit, 5-bit: derived numerically from Beta(d/2, d/2)
44+
45+
We already have `src/core/tq_codebook.c` for codebook quantization — we just need to populate it with the Lloyd-Max centroids the paper specifies (or computes).
46+
47+
### 3. Stored ‖r‖₂
48+
49+
The residual norm is a single FP16 scalar per vector. We don't store this currently. Trivial to add to the `block_tq_turbo` struct.
50+
51+
### 4. Inner product estimator
52+
53+
The current `tq_turbo_attention_ref` does straight dot product on the dequantized sum. Google's estimator combines the two stages explicitly with the residual norm. This affects accuracy at low bit budgets.
54+
55+
## What we already have (good news)
56+
57+
| Building block | Status |
58+
|---|---|
59+
| `tq_rht.c` Random Hadamard Transform | ✅ Implemented |
60+
| `tq_qjl.c` Quantized Johnson-Lindenstrauss | ✅ Implemented (1-bit sketch) |
61+
| `tq_codebook.c` Codebook quantizer infrastructure | ✅ Implemented |
62+
| `tq_turbo.c` 2-stage composition framework | ✅ Implemented (wrong Stage 1) |
63+
| Plugin architecture (`tq_traits.c` 3-function registration) | ✅ Implemented |
64+
| Block-wise KV cache with per-step quantization | ✅ Implemented |
65+
| Multi-architecture inference (Llama 3, Qwen, Gemma, etc.) | ✅ Implemented |
66+
67+
We have ~90% of the infrastructure. The missing 10% is:
68+
1. Compose RHT + Lloyd-Max into a new `TQ_TURBOQUANT_*` type
69+
2. Precompute Lloyd-Max centroids
70+
3. Wire the proper inner product estimator
71+
72+
## Implementation plan
73+
74+
### Phase 1: Add `TQ_TURBOQUANT_3B` and `TQ_TURBOQUANT_4B` types (1–2 days)
75+
76+
```c
77+
// New block layout
78+
typedef struct {
79+
uint16_t residual_norm_fp16; // ‖r‖₂ stored as FP16
80+
uint8_t rotated_quant[BLOCK_BYTES]; // Lloyd-Max codes after rotation
81+
uint8_t qjl_residual[QJL_BYTES]; // 1-bit QJL on residual
82+
} block_tq_turboquant_3b;
83+
84+
void tq_turboquant_quantize_ref(const float* src, void* dst, int n) {
85+
// 1. Apply random Hadamard transform: x̃ = Π·x
86+
float rotated[MAX_DIM];
87+
tq_rht_apply(src, rotated, n, /*seed=*/FIXED_SEED);
88+
89+
// 2. Lloyd-Max quantize each coordinate
90+
quantize_lloyd_max(rotated, block->rotated_quant, n, /*bits=*/2);
91+
92+
// 3. Compute residual r = x̃ − dequant(rotated_quant)
93+
float recon[MAX_DIM], residual[MAX_DIM];
94+
dequant_lloyd_max(block->rotated_quant, recon, n, 2);
95+
for (int i = 0; i < n; i++) residual[i] = rotated[i] - recon[i];
96+
97+
// 4. Store ‖r‖₂ in FP16
98+
float r_norm = vec_l2_norm(residual, n);
99+
block->residual_norm_fp16 = fp32_to_fp16(r_norm);
100+
101+
// 5. Normalize residual and apply 1-bit QJL
102+
for (int i = 0; i < n; i++) residual[i] /= r_norm;
103+
tq_qjl_quantize_ref(residual, block->qjl_residual, n);
104+
}
105+
```
106+
107+
### Phase 2: Precompute Lloyd-Max centroids (4 hours)
108+
109+
Run Lloyd-Max iteration offline for d=64, 128, 256, 512, 1024 at b=1,2,3,4,5 bits. Store as static const float arrays in `tq_codebook.c`. Cite the closed-form approximation from the paper for spot-check.
110+
111+
### Phase 3: Inner product estimator (4 hours)
112+
113+
```c
114+
void tq_turboquant_attention_ref(const float* query, const void* kv,
115+
float* scores, int seq_len, int head_dim) {
116+
// Apply same rotation to query (or use pre-rotated query)
117+
float q_rot[MAX_DIM];
118+
tq_rht_apply(query, q_rot, head_dim, FIXED_SEED);
119+
120+
for (int s = 0; s < seq_len; s++) {
121+
const block_tq_turboquant_3b* b = &kv[s];
122+
123+
// Stage 1: ⟨q_rot, x̃_mse⟩
124+
float dot1 = 0;
125+
float k_recon[MAX_DIM];
126+
dequant_lloyd_max(b->rotated_quant, k_recon, head_dim, 2);
127+
for (int d = 0; d < head_dim; d++) dot1 += q_rot[d] * k_recon[d];
128+
129+
// Stage 2: ‖r‖₂·⟨q_rot, Q_qjl⁻¹(qjl)⟩
130+
float r_norm = fp16_to_fp32(b->residual_norm_fp16);
131+
float qjl_dot = tq_qjl_dot_with_query(q_rot, b->qjl_residual, head_dim);
132+
133+
scores[s] = dot1 + r_norm * qjl_dot;
134+
}
135+
}
136+
```
137+
138+
### Phase 4: Validation against the paper (2 days)
139+
140+
Reproduce the paper's headline numbers within ±1%:
141+
142+
| Model | Method | Paper LongBench-E | Our target |
143+
|---|---|---|---|
144+
| Llama-3.1-8B | TurboQuant 2.5-bit | 49.44 | 48.4–50.4 |
145+
| Llama-3.1-8B | TurboQuant 3.5-bit | 50.06 | 49.0–51.0 |
146+
| Llama-3.1-8B | Full cache | 50.06 | (baseline) |
147+
| Ministral-7B | TurboQuant 2.5-bit | 49.62 | 48.6–50.6 |
148+
| Ministral-7B | Full cache | 49.89 | (baseline) |
149+
150+
Plus Needle-in-Haystack at 0.997 vs 1.000 baseline.
151+
152+
### Phase 5: llama.cpp PR (3–5 days)
153+
154+
Once Phase 4 passes, port the kernel to ggml as a new `GGML_TYPE_TURBOQUANT_K` and submit to [Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969). Differentiator vs the existing forks: (a) reproduces paper numbers, (b) clean ggml type registration, (c) backed by an end-to-end working C reference.
155+
156+
## Naming hygiene going forward
157+
158+
| Current name | Keeps | Future name |
159+
|---|---|---|
160+
| `TQ_POLAR_3B` / `TQ_POLAR_4B` | ✅ kept | "PolarQuant-style: polar coordinate quantization (predates Google paper)" |
161+
| `TQ_TURBO_3B` / `TQ_TURBO_4B` | ✅ kept | "Turbo-style: PolarQuant + QJL residual (our original composition)" |
162+
| (new) `TQ_TURBOQUANT_3B` / `TQ_TURBOQUANT_4B` | new | "Google TurboQuant (ICLR 2026): RHT + Lloyd-Max + 1-bit QJL residual + ‖r‖ scalar" |
163+
164+
This way users can choose between the two compositions and compare directly.
165+
166+
## Expected outcome
167+
168+
After Phase 1–4, quant.cpp becomes the **first single-header C implementation** of the published TurboQuant algorithm with reproduced paper numbers. After Phase 5, we have a credible llama.cpp PR with the strongest narrative in Discussion #20969.

0 commit comments

Comments
 (0)