Commit ac3c46a
turbo_kv: Variant F BREAKTHROUGH — drop dead QJL stage, double the codebook
The Karpathy-loop ablation in commit 4da6915 showed that the QJL residual
correction stage contributed *byte-identical* zero to the final attention
scores. The implementation was correct in form but the constant √(π/2)/m
on Rademacher rows yields a contribution dwarfed by the residual norm,
which after a good codebook fit is tiny.
Variant F: drop QJL entirely from turbo_kv_3b/4b and reinvest the freed
16 bytes of qjl_signs into a *larger* codebook. Same block size, one
extra bit of resolution per element.
Layout change (turbo_kv_4b, still 72 bytes):
before: 8 hdr + 48 mse_3bit + 16 qjl_signs
after: 8 hdr + 64 mse_4bit
Layout change (turbo_kv_3b, still 56 bytes):
before: 8 hdr + 32 mse_2bit + 16 qjl_signs
after: 8 hdr + 48 mse_3bit
Combined with max-abs scaling (Variant B winner from round 2), the
resulting estimator is single-stage RHT + 2^b-level Lloyd-Max codebook
+ ‖x‖. Cleaner than the paper's 2-stage and empirically much better
on our perplexity benchmark.
Llama 3.2 3B PPL on bench/data/ppl_1k.txt (FP32 baseline = 13.56):
Config | Before | After (Variant F) | Δ | vs uniform_4b
---------------|--------|-------------------|--------|---------------
uniform_4b | 14.41 | 14.41 | 0 | reference
turbo_kv_4b | 16.03 | 14.28 | -1.75 | BEATS by 0.13 ⭐
turbo_kv_3b | 25.84 | 15.39 | -10.45 | within 1.0
turbo_kv_4b is now the best 4-bit KV quantization in the project, beating
the previous production champion uniform_4b at the same bit budget. This
also closes a major part of the gap to Google TurboQuant — we can now
honestly say "TurboQuant-class compression" instead of "TurboQuant
structure with broken numbers".
Tests: 35/35 passing. The QJLSignsNonZero test was removed (no longer
applies — see test comment).
Karpathy loop summary:
Round 1 (empirical std): 4b 15.87 3b 25.07
Round 2 (max-abs / no clip): 4b 15.39 3b 84.97 ❌ revert
Round 3 (99th percentile): 4b 17.24 ❌ revert
Round 4 (K*std sweep): best K=2.0 → 15.53 (worse than B)
Round 5 (uniform linear): 4b 16.28 ❌ revert
Round 6 (Variant F: 4-bit cb): 4b 14.28 ✅ 3b 15.39 ✅ HIT
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent fe6df17 commit ac3c46a
3 files changed
Lines changed: 90 additions & 265 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
204 | 204 | | |
205 | 205 | | |
206 | 206 | | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
211 | 214 | | |
212 | 215 | | |
213 | 216 | | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
219 | 221 | | |
220 | 222 | | |
221 | | - | |
222 | | - | |
223 | | - | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
224 | 230 | | |
225 | 231 | | |
226 | 232 | | |
227 | | - | |
228 | | - | |
| 233 | + | |
| 234 | + | |
229 | 235 | | |
230 | | - | |
231 | | - | |
| 236 | + | |
232 | 237 | | |
233 | 238 | | |
234 | 239 | | |
| |||
270 | 275 | | |
271 | 276 | | |
272 | 277 | | |
273 | | - | |
274 | | - | |
| 278 | + | |
| 279 | + | |
275 | 280 | | |
276 | 281 | | |
277 | 282 | | |
| |||
0 commit comments