You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New TQ_TYPE_TURBO_KV_4BO type. Each block stores the 8 channels with
the largest |rotated[i]| as exact FP16 values (with their indices) on
top of the existing Variant F 4-bit Lloyd-Max codebook. At dequant
time these 8 positions are overwritten with the stored exact values,
eliminating the worst quantization errors per block.
This is a simpler, local form of the per-channel outlier handling
described in the Google TurboQuant paper.
Llama 3.2 3B PPL on bench/data/ppl_1k.txt (FP32 = 13.56):
turbo_kv_4b 14.28 (+5.3%) ← 72B
turbo_kv_4bo 13.86 (+2.2%) ← 96B ← gap cut by 58%
turbo_kv_5b 13.60 (+0.34%) ← 88B
SmolLM2 135M PPL (FP32 = 18.62):
turbo_kv_4b 19.70 (+5.8%)
turbo_kv_4bo 19.29 (+3.6%) ← gap cut by 38%
turbo_kv_5b 18.94 (+1.7%)
The technique works (validates Issue #15's per-channel outlier
hypothesis), but at 96 bytes the variant is currently bigger than 5b
(88B) without matching its quality. Next iteration will combine
outliers with a 3-bit base codebook (turbo_kv_3bo, ~80 bytes) to test
whether outliers + smaller base can beat 5b at smaller block size.
Block layout (96 bytes):
norm(2) + residual_norm(2) + inv_std(2) + _pad(2)
mse_indices[64] // 4-bit packed (Variant F base)
out_indices[8] // 1 byte per outlier
out_values[8] // FP16 per outlier
Quantize finds top-K outliers by |rotated| and stores them verbatim.
The codebook scaling uses BODY-only max-abs (excluding outliers) so
the codebook doesn't waste resolution on the tails the outliers
already capture exactly.
35/35 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments