|
| 1 | +# Progressive KV Cache Compression in a Single-Header C Engine: Independent Validation, Negative Results, and Practical Deployment |
| 2 | + |
| 3 | +**Authors:** QuantumAI Research |
| 4 | + |
| 5 | +**Abstract.** We report an independent empirical validation of recency-based KV cache compression — keeping the last 128 tokens at FP32 while compressing the rest to 4-bit — in a minimal, single-header C inference engine. On Llama 3.2 3B at 3,970 tokens, this achieves PPL −0.1% vs. FP32 at 3× compression and +13% speed. While the recency-window approach has been explored concurrently by KVTC [7] and PM-KVQ [8], we contribute: (1) a demonstration that RHT normalization eliminates the need for per-layer calibration (max ~0.9% benefit from optimal per-layer allocation), (2) an honest negative result showing that 2-bit compression with a 512-token window, which appeared Pareto-dominant at 957 tokens (53% FP32), collapses to +36.7% PPL at honest evaluation lengths — an artifact we publicly retracted, (3) a context-length invariance measurement showing the same 3.2 pp improvement at both 957 tokens (13.4% FP32) and 3,970 tokens (3.2% FP32), and (4) a complete open-source implementation in 16K lines of C with zero dependencies, installable via `pip install quantcpp`. We also report an O(n log n) BPE tokenizer fix that was necessary to enable honest long-context evaluation. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Introduction |
| 10 | + |
| 11 | +KV cache compression is an active area of research, with methods ranging from uniform quantization (llama.cpp Q4_0/Q8_0), per-channel calibration (KIVI [1], KVQuant [2]), attention-saliency-based adaptive precision (ZipCache [3]), to recent transform-coding approaches (KVTC [7]). A common finding across this literature is that **recent tokens require higher precision** — KVTC [7] keeps the 128 most recent tokens uncompressed, PM-KVQ [8] progressively lowers bit-width for older entries, and ZipCache [3] assigns more bits to attention-salient tokens. |
| 12 | + |
| 13 | +We arrived at the same finding independently through a Karpathy-loop optimization process on quant.cpp, a single-header C inference engine. Rather than claiming novelty for the recency-window approach itself, we contribute the following: |
| 14 | + |
| 15 | +### Contributions |
| 16 | + |
| 17 | +1. **RHT eliminates per-layer calibration.** We show that after Random Hadamard Transform normalization, post-RHT kurtosis across 28 layers of Llama 3.2 3B ranges only 2.64–3.81 (mean 3.04, std 0.25). The theoretical maximum benefit from optimal per-layer bit allocation is ~0.9% PPL. This means the optimization landscape for KV compression is fundamentally **temporal** (which tokens), not **spatial** (which layers) — a finding that simplifies method design. |
| 18 | + |
| 19 | +2. **Honest negative result.** We initially claimed that 2-bit + 512-token FP32 window "Pareto-dominates" flat 4-bit. This was measured at 957 tokens where 53.5% of tokens were FP32 — misleading. At 3,970 tokens (12.9% FP32), 2-bit PPL degraded to +36.7%. We retracted the claim and report it here as a cautionary example of short-context evaluation artifacts. |
| 20 | + |
| 21 | +3. **Context-length invariance.** We measure the quality improvement from the 128-token window at two scales: |
| 22 | + - 957 tokens (k128 = 13.4% FP32): +3.8% → +0.6% (improvement: 3.2 pp) |
| 23 | + - 3,970 tokens (k128 = 3.2% FP32): +3.1% → −0.1% (improvement: 3.2 pp) |
| 24 | + |
| 25 | + The same 3.2 percentage point improvement with 4× less FP32 ratio suggests the result extends to arbitrary context lengths. |
| 26 | + |
| 27 | +4. **Practical deployment.** The entire method — including RHT, Lloyd-Max codebooks, progressive window, infinite scrollback (automatic context shift), and KV cache persistence (save/load to disk) — is implemented in a single C header file (16K LOC, 654 KB) with zero dependencies, distributed via PyPI (`pip install quantcpp`) and as a 193 KB WASM binary. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## 2. Related Work |
| 32 | + |
| 33 | +### 2.1 Uniform KV Quantization |
| 34 | + |
| 35 | +llama.cpp offers Q4_0 and Q8_0 KV types with per-block min-max scaling, achieving ~2× compression at +10.6% PPL (Q4_0). KIVI [1] applies asymmetric 2-bit quantization with per-channel key and per-token value precision. KVQuant [2] adds pre-RoPE quantization, non-uniform per-layer datatypes, and per-vector dense-and-sparse quantization, achieving <0.1% PPL at 3-bit. |
| 36 | + |
| 37 | +### 2.2 Non-Uniform Per-Token Precision |
| 38 | + |
| 39 | +**ZipCache** [3] identifies attention-salient tokens and assigns them higher bit-width, achieving 4.98× compression at 0.38% accuracy drop. This is the first per-token adaptive approach, using saliency (attention score magnitude) rather than recency as the allocation criterion. |
| 40 | + |
| 41 | +**KVTC** [7] (NVIDIA, ICLR 2026) keeps the 128 most recent tokens and 4 "attention sink" tokens uncompressed while applying PCA + entropy coding to the rest. This is structurally the closest prior work to our method — the 128-token recency window is identical. |
| 42 | + |
| 43 | +**PM-KVQ** [8] (Tsinghua, 2025) designs a progressive quantization strategy that gradually lowers bit-width of older KV cache entries with block-wise memory allocation. |
| 44 | + |
| 45 | +**"More Tokens, Lower Precision"** [9] (EMNLP 2025) demonstrates that storing 4× more tokens at 4-bit outperforms 1× tokens at 16-bit, directly supporting the temporal compression thesis. |
| 46 | + |
| 47 | +### 2.3 Transform-Based Normalization |
| 48 | + |
| 49 | +**HIGGS** [4] introduces RHT + MSE-optimal grid quantization for weight compression. **TurboQuant** [5] applies the same pattern to KV caches with a 1-bit QJL residual. Our implementation builds on TurboQuant's RHT + Lloyd-Max structure, with the QJL residual dropped through ablation (it contributed ~zero to attention scores). |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## 3. Method |
| 54 | + |
| 55 | +### 3.1 Progressive KV Compression |
| 56 | + |
| 57 | +We partition KV cache tokens into two tiers using a single parameter $W$: |
| 58 | +- **Hot tier** (last $W$ tokens): Keys at FP32 |
| 59 | +- **Cold tier** (all other tokens): Keys at 4-bit (RHT + 16-level Lloyd-Max codebook) |
| 60 | +- **All tiers**: Values at FP16 |
| 61 | + |
| 62 | +The additional memory for the hot tier at $W$=128, Llama 3.2 3B: 14.7 MB (0.6% of the 32K cold-tier cache). |
| 63 | + |
| 64 | +### 3.2 Attention-Aligned Rationale |
| 65 | + |
| 66 | +The total weighted quantization error is $E = \sum_t \alpha_t \cdot \text{MSE}(K_t, \hat{K}_t)$, where $\alpha_t$ is the attention weight at position $t$. Causal attention concentrates $\alpha$ on recent tokens, so allocating full precision to the high-$\alpha$ region minimizes $E$. This is the same rationale behind KVTC [7] and ZipCache [3]'s saliency-based allocation. |
| 67 | + |
| 68 | +Our contribution is not the rationale itself but the empirical measurement of its **context-length invariance** and the **RHT-based spatial analysis** that eliminates the per-layer dimension. |
| 69 | + |
| 70 | +### 3.3 RHT Eliminates Per-Layer Variation |
| 71 | + |
| 72 | +Post-RHT kurtosis across 28 layers: mean 3.04, std 0.25 (range 2.64–3.81), compared to pre-RHT range 4.13–20.62. The variance of $\log_2(\sigma)$ across layers is 0.0177 → theoretical max MSE improvement from per-layer allocation: ~1.8% → ~0.9% PPL. This is below measurement noise. |
| 73 | + |
| 74 | +**Implication:** Methods that invest complexity in per-layer calibration (KIVI, KVQuant) gain little benefit when RHT normalization is applied. The optimization landscape is purely temporal. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 4. Experiments |
| 79 | + |
| 80 | +**Model:** Llama 3.2 3B Instruct (Q8_0 weights) |
| 81 | +**Hardware:** Apple M1 Pro, 16 GB RAM, 8 threads, CPU-only |
| 82 | +**Evaluation:** Teacher-forced perplexity on English text (957 tokens and 3,970 tokens) |
| 83 | +**Tokenizer:** Custom BPE with O(n log n) heap-based merge |
| 84 | + |
| 85 | +### 4.1 Progressive Compression Quality |
| 86 | + |
| 87 | +**At 3,970 tokens** (k128 = 3.2% FP32 — honest condition): |
| 88 | + |
| 89 | +| Configuration | PPL | vs FP32 | |
| 90 | +|---|---:|---:| |
| 91 | +| FP32 (baseline) | 19.41 | — | |
| 92 | +| **4-bit + k128** | **19.39** | **−0.1%** | |
| 93 | +| 4-bit flat | 20.02 | +3.1% | |
| 94 | +| 2-bit + k512 | 26.53 | +36.7% | |
| 95 | + |
| 96 | +### 4.2 Context-Length Invariance |
| 97 | + |
| 98 | +| Eval Length | k128 FP32 Ratio | Improvement vs Flat | |
| 99 | +|---:|---:|---:| |
| 100 | +| 957 tokens | 13.4% | 3.2 pp | |
| 101 | +| 3,970 tokens | 3.2% | 3.2 pp | |
| 102 | + |
| 103 | +### 4.3 Window Size Saturation |
| 104 | + |
| 105 | +| $W$ | PPL (957 tok) | vs FP32 | |
| 106 | +|---:|---:|---:| |
| 107 | +| 0 | 14.08 | +3.8% | |
| 108 | +| 64 | 13.71 | +1.1% | |
| 109 | +| **128** | **13.64** | **+0.6%** | |
| 110 | +| 256 | 13.64 | +0.6% | |
| 111 | + |
| 112 | +### 4.4 Memory and Speed (32K context) |
| 113 | + |
| 114 | +| Config | KV Memory | Speed | |
| 115 | +|---|---:|---:| |
| 116 | +| FP32 | 7.17 GB | 6.9 tok/s | |
| 117 | +| 4-bit + k128 | 2.33 GB | 7.8 tok/s (+13%) | |
| 118 | + |
| 119 | +### 4.5 Negative Result: 2-bit Compression |
| 120 | + |
| 121 | +At 957 tokens, 2-bit + k512 showed PPL +4.3% (k512 = 53.5% FP32). We initially claimed this "Pareto-dominated" flat 4-bit. At 3,970 tokens (k512 = 12.9% FP32), PPL collapsed to +36.7%. |
| 122 | + |
| 123 | +**Root cause:** At 957 tokens, the 512-token FP32 window covered more than half the evaluation, masking the 2-bit degradation. This is a general hazard of evaluating KV compression at short context lengths with large FP32 windows. |
| 124 | + |
| 125 | +### 4.6 Layer-Adaptive Analysis (Negative Result) |
| 126 | + |
| 127 | +Post-RHT kurtosis variation is insufficient for per-layer adaptation to provide meaningful benefit (~0.9% max). This is a positive finding for method simplicity. |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## 5. Engineering Contributions |
| 132 | + |
| 133 | +### 5.1 Single-Header Implementation |
| 134 | + |
| 135 | +The complete method — RHT, Lloyd-Max codebooks, progressive window, infinite scrollback, KV persistence, NEON/AVX2 SIMD kernels — is implemented in `quant.h` (16K LOC, 654 KB) with zero dependencies beyond libc. |
| 136 | + |
| 137 | +### 5.2 O(n log n) BPE Tokenizer |
| 138 | + |
| 139 | +The standard BPE merge algorithm is O(n²). For GPT-style byte-level BPE, a 17K-character text produces ~17K initial tokens, making naive merging impractical (~289M operations). We implemented a max-heap with lazy deletion, reducing merge complexity to O(n log n). This was necessary to enable the 3,970-token evaluation that caught the 2-bit artifact. |
| 140 | + |
| 141 | +### 5.3 Distribution |
| 142 | + |
| 143 | +- **PyPI:** `pip install quantcpp` (pre-built wheels for Linux x86_64/aarch64, macOS arm64) |
| 144 | +- **WASM:** 193 KB browser demo with IndexedDB model caching |
| 145 | +- **Model registry:** Auto-download from HuggingFace (`Model.from_pretrained("Llama-3.2-1B")`) |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## 6. Discussion |
| 150 | + |
| 151 | +### 6.1 Relationship to KVTC |
| 152 | + |
| 153 | +KVTC [7] uses the same 128-token sliding window but adds PCA dimensionality reduction and entropy coding for the compressed region. Our approach is simpler (binary FP32/4-bit) and achieves comparable quality. We view this as convergent evidence that recency-based precision allocation is a robust principle. |
| 154 | + |
| 155 | +### 6.2 Self-Correction as Methodology |
| 156 | + |
| 157 | +Our project maintains a public correction log (10 self-found corrections, 0 external reports). The 2-bit Pareto claim (#10) was caught by our own evaluation infrastructure improvements (BPE fix → longer eval → honest measurement). We believe systematic self-validation — measuring, doubting, re-measuring at harder conditions — is as important as the algorithmic contribution. |
| 158 | + |
| 159 | +### 6.3 Limitations |
| 160 | + |
| 161 | +1. Single model (Llama 3.2 3B). Multi-model validation needed. |
| 162 | +2. CPU-only speed measurements. GPU behavior may differ. |
| 163 | +3. Maximum evaluated context: 3,970 tokens. 32K+ validation pending. |
| 164 | +4. V cache not progressively compressed (FP16 throughout). |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## 7. Conclusion |
| 169 | + |
| 170 | +We provide independent empirical evidence that recency-based KV cache precision allocation — keeping 128 recent tokens at FP32 — achieves FP32 quality at 3× compression, confirming findings from KVTC [7] and PM-KVQ [8] in a simpler setting. Our additional contributions — the RHT spatial analysis, the retracted 2-bit result, the context-length invariance measurement, and the single-header open-source implementation — complement the existing literature with practical validation and honest methodology. |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +## References |
| 175 | + |
| 176 | +[1] Z. Liu et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. arXiv:2402.02750. |
| 177 | + |
| 178 | +[2] C. Hooper et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." NeurIPS 2024. arXiv:2401.18079. |
| 179 | + |
| 180 | +[3] Y. He et al. "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification." NeurIPS 2024. arXiv:2405.14256. |
| 181 | + |
| 182 | +[4] V. Malinovskii et al. "HIGGS: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem." NAACL 2025. arXiv:2411.17525. |
| 183 | + |
| 184 | +[5] A. Zandieh et al. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874. |
| 185 | + |
| 186 | +[6] G. Xiao et al. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv:2309.17453. |
| 187 | + |
| 188 | +[7] KVTC. "KV Cache Transform Coding." ICLR 2026. arXiv:2511.01815. |
| 189 | + |
| 190 | +[8] Liu et al. "PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs." 2025. arXiv:2505.18610. |
| 191 | + |
| 192 | +[9] "More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression." EMNLP 2025. arXiv:2412.12706. |
| 193 | + |
| 194 | +--- |
| 195 | + |
| 196 | +**Reproducibility.** All code: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp). Install: `pip install quantcpp`. Benchmark artifacts in `bench/results/`. |
| 197 | + |
| 198 | +**Correction log.** 10 self-corrections documented in [CHANGELOG.md](https://github.com/quantumaikr/quant.cpp/blob/main/CHANGELOG.md). Correction #10 (2-bit Pareto retraction) is discussed in Section 4.5. |
0 commit comments