You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
arXiv draft: populate Section 4.3 with cross-size validation
Three-model Pareto table (135M / 1B / 3B), explicit cross-size pattern
observations, and an honest deferral note for the Llama 3.1 8B
reproduction (memory constraints on 16 GB test machine).
1.**`turbo_kv_5b` (5-bit) is consistently near-lossless** across model sizes — PPL Δ stays at 0.7–1.7%. The Lloyd-Max-Gaussian 32-level codebook captures enough resolution that the rotation-then-quantize round-trip preserves attention scores almost exactly, regardless of the underlying model's KV distribution.
167
+
2.**`turbo_kv_4b` quality is 5–8% PPL Δ across sizes**, slightly worse on smaller models. The 16-level codebook is the right point for production: under 6% PPL degradation at 7× compression.
168
+
3.**`turbo_kv_3b` is unsuitable for models below 3B parameters**. PPL jumps from +13.3% on 3B to +61% on 1B. The 8-level codebook is too coarse for the more concentrated KV distributions of small models. Recommend `turbo_kv_3b` only for models ≥ 3B parameters.
169
+
4.**Speed gap to fp32 widens on smaller models** (−7% on 3B → −14% on 1B → −20% on 135M). The per-token attention overhead is a larger fraction of total work when matmul is small, so the (small) per-block dequant overhead dominates.
The Google TurboQuant paper reports on Llama 3.1 8B with LongBench-E, which we did not run due to memory and time constraints on our 16 GB test machine. Q8_0 (8 GB) hit swap; Q4_K_M (4.6 GB) was prohibitively slow (>50 min per fp32 measurement). This validation is deferred to a future session with more RAM. Section 7 (Reproducibility) provides the script for any reader who wants to run it.
0 commit comments