Skip to content

Commit dc89ab1

Browse files
unamedkrclaude
andcommitted
papers: progressive KV compression (EN + KO) — honest framing
Research paper in both English and Korean. Key framing decisions: 1. NOT claiming novelty for the 128-token FP32 window — KVTC (NVIDIA, ICLR 2026, arXiv:2511.01815) uses the same approach. PM-KVQ and ZipCache also do non-uniform per-token precision. We frame our work as "independent validation" with additional contributions. 2. Actual novel contributions clearly identified: - RHT eliminates per-layer calibration need (~0.9% max benefit) - Context-length invariance measurement (3.2 pp at both scales) - Honest negative result (2-bit Pareto retraction) - Single-header C implementation (engineering) - O(n log n) BPE tokenizer (engineering) 3. All 9 references verified against actual papers: - Fixed HIGGS author: V. Malinovskii (not A.) - Added venue info: KIVI→ICML, KVQuant→NeurIPS, HIGGS→NAACL - Added 3 missing citations: KVTC, PM-KVQ, "More Tokens Lower Precision" - Corrected KIVI description (asymmetric per-channel/per-token) 4. "Self-correction as methodology" explicitly discussed as a contribution — 10 public corrections, 0 external reports. This is honest correction #11: the initial paper draft overclaimed novelty. The revised version accurately positions our contribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8f6c663 commit dc89ab1

2 files changed

Lines changed: 387 additions & 0 deletions

File tree

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Progressive KV Cache Compression in a Single-Header C Engine: Independent Validation, Negative Results, and Practical Deployment
2+
3+
**Authors:** QuantumAI Research
4+
5+
**Abstract.** We report an independent empirical validation of recency-based KV cache compression — keeping the last 128 tokens at FP32 while compressing the rest to 4-bit — in a minimal, single-header C inference engine. On Llama 3.2 3B at 3,970 tokens, this achieves PPL −0.1% vs. FP32 at 3× compression and +13% speed. While the recency-window approach has been explored concurrently by KVTC [7] and PM-KVQ [8], we contribute: (1) a demonstration that RHT normalization eliminates the need for per-layer calibration (max ~0.9% benefit from optimal per-layer allocation), (2) an honest negative result showing that 2-bit compression with a 512-token window, which appeared Pareto-dominant at 957 tokens (53% FP32), collapses to +36.7% PPL at honest evaluation lengths — an artifact we publicly retracted, (3) a context-length invariance measurement showing the same 3.2 pp improvement at both 957 tokens (13.4% FP32) and 3,970 tokens (3.2% FP32), and (4) a complete open-source implementation in 16K lines of C with zero dependencies, installable via `pip install quantcpp`. We also report an O(n log n) BPE tokenizer fix that was necessary to enable honest long-context evaluation.
6+
7+
---
8+
9+
## 1. Introduction
10+
11+
KV cache compression is an active area of research, with methods ranging from uniform quantization (llama.cpp Q4_0/Q8_0), per-channel calibration (KIVI [1], KVQuant [2]), attention-saliency-based adaptive precision (ZipCache [3]), to recent transform-coding approaches (KVTC [7]). A common finding across this literature is that **recent tokens require higher precision** — KVTC [7] keeps the 128 most recent tokens uncompressed, PM-KVQ [8] progressively lowers bit-width for older entries, and ZipCache [3] assigns more bits to attention-salient tokens.
12+
13+
We arrived at the same finding independently through a Karpathy-loop optimization process on quant.cpp, a single-header C inference engine. Rather than claiming novelty for the recency-window approach itself, we contribute the following:
14+
15+
### Contributions
16+
17+
1. **RHT eliminates per-layer calibration.** We show that after Random Hadamard Transform normalization, post-RHT kurtosis across 28 layers of Llama 3.2 3B ranges only 2.64–3.81 (mean 3.04, std 0.25). The theoretical maximum benefit from optimal per-layer bit allocation is ~0.9% PPL. This means the optimization landscape for KV compression is fundamentally **temporal** (which tokens), not **spatial** (which layers) — a finding that simplifies method design.
18+
19+
2. **Honest negative result.** We initially claimed that 2-bit + 512-token FP32 window "Pareto-dominates" flat 4-bit. This was measured at 957 tokens where 53.5% of tokens were FP32 — misleading. At 3,970 tokens (12.9% FP32), 2-bit PPL degraded to +36.7%. We retracted the claim and report it here as a cautionary example of short-context evaluation artifacts.
20+
21+
3. **Context-length invariance.** We measure the quality improvement from the 128-token window at two scales:
22+
- 957 tokens (k128 = 13.4% FP32): +3.8% → +0.6% (improvement: 3.2 pp)
23+
- 3,970 tokens (k128 = 3.2% FP32): +3.1% → −0.1% (improvement: 3.2 pp)
24+
25+
The same 3.2 percentage point improvement with 4× less FP32 ratio suggests the result extends to arbitrary context lengths.
26+
27+
4. **Practical deployment.** The entire method — including RHT, Lloyd-Max codebooks, progressive window, infinite scrollback (automatic context shift), and KV cache persistence (save/load to disk) — is implemented in a single C header file (16K LOC, 654 KB) with zero dependencies, distributed via PyPI (`pip install quantcpp`) and as a 193 KB WASM binary.
28+
29+
---
30+
31+
## 2. Related Work
32+
33+
### 2.1 Uniform KV Quantization
34+
35+
llama.cpp offers Q4_0 and Q8_0 KV types with per-block min-max scaling, achieving ~2× compression at +10.6% PPL (Q4_0). KIVI [1] applies asymmetric 2-bit quantization with per-channel key and per-token value precision. KVQuant [2] adds pre-RoPE quantization, non-uniform per-layer datatypes, and per-vector dense-and-sparse quantization, achieving <0.1% PPL at 3-bit.
36+
37+
### 2.2 Non-Uniform Per-Token Precision
38+
39+
**ZipCache** [3] identifies attention-salient tokens and assigns them higher bit-width, achieving 4.98× compression at 0.38% accuracy drop. This is the first per-token adaptive approach, using saliency (attention score magnitude) rather than recency as the allocation criterion.
40+
41+
**KVTC** [7] (NVIDIA, ICLR 2026) keeps the 128 most recent tokens and 4 "attention sink" tokens uncompressed while applying PCA + entropy coding to the rest. This is structurally the closest prior work to our method — the 128-token recency window is identical.
42+
43+
**PM-KVQ** [8] (Tsinghua, 2025) designs a progressive quantization strategy that gradually lowers bit-width of older KV cache entries with block-wise memory allocation.
44+
45+
**"More Tokens, Lower Precision"** [9] (EMNLP 2025) demonstrates that storing 4× more tokens at 4-bit outperforms 1× tokens at 16-bit, directly supporting the temporal compression thesis.
46+
47+
### 2.3 Transform-Based Normalization
48+
49+
**HIGGS** [4] introduces RHT + MSE-optimal grid quantization for weight compression. **TurboQuant** [5] applies the same pattern to KV caches with a 1-bit QJL residual. Our implementation builds on TurboQuant's RHT + Lloyd-Max structure, with the QJL residual dropped through ablation (it contributed ~zero to attention scores).
50+
51+
---
52+
53+
## 3. Method
54+
55+
### 3.1 Progressive KV Compression
56+
57+
We partition KV cache tokens into two tiers using a single parameter $W$:
58+
- **Hot tier** (last $W$ tokens): Keys at FP32
59+
- **Cold tier** (all other tokens): Keys at 4-bit (RHT + 16-level Lloyd-Max codebook)
60+
- **All tiers**: Values at FP16
61+
62+
The additional memory for the hot tier at $W$=128, Llama 3.2 3B: 14.7 MB (0.6% of the 32K cold-tier cache).
63+
64+
### 3.2 Attention-Aligned Rationale
65+
66+
The total weighted quantization error is $E = \sum_t \alpha_t \cdot \text{MSE}(K_t, \hat{K}_t)$, where $\alpha_t$ is the attention weight at position $t$. Causal attention concentrates $\alpha$ on recent tokens, so allocating full precision to the high-$\alpha$ region minimizes $E$. This is the same rationale behind KVTC [7] and ZipCache [3]'s saliency-based allocation.
67+
68+
Our contribution is not the rationale itself but the empirical measurement of its **context-length invariance** and the **RHT-based spatial analysis** that eliminates the per-layer dimension.
69+
70+
### 3.3 RHT Eliminates Per-Layer Variation
71+
72+
Post-RHT kurtosis across 28 layers: mean 3.04, std 0.25 (range 2.64–3.81), compared to pre-RHT range 4.13–20.62. The variance of $\log_2(\sigma)$ across layers is 0.0177 → theoretical max MSE improvement from per-layer allocation: ~1.8% → ~0.9% PPL. This is below measurement noise.
73+
74+
**Implication:** Methods that invest complexity in per-layer calibration (KIVI, KVQuant) gain little benefit when RHT normalization is applied. The optimization landscape is purely temporal.
75+
76+
---
77+
78+
## 4. Experiments
79+
80+
**Model:** Llama 3.2 3B Instruct (Q8_0 weights)
81+
**Hardware:** Apple M1 Pro, 16 GB RAM, 8 threads, CPU-only
82+
**Evaluation:** Teacher-forced perplexity on English text (957 tokens and 3,970 tokens)
83+
**Tokenizer:** Custom BPE with O(n log n) heap-based merge
84+
85+
### 4.1 Progressive Compression Quality
86+
87+
**At 3,970 tokens** (k128 = 3.2% FP32 — honest condition):
88+
89+
| Configuration | PPL | vs FP32 |
90+
|---|---:|---:|
91+
| FP32 (baseline) | 19.41 ||
92+
| **4-bit + k128** | **19.39** | **−0.1%** |
93+
| 4-bit flat | 20.02 | +3.1% |
94+
| 2-bit + k512 | 26.53 | +36.7% |
95+
96+
### 4.2 Context-Length Invariance
97+
98+
| Eval Length | k128 FP32 Ratio | Improvement vs Flat |
99+
|---:|---:|---:|
100+
| 957 tokens | 13.4% | 3.2 pp |
101+
| 3,970 tokens | 3.2% | 3.2 pp |
102+
103+
### 4.3 Window Size Saturation
104+
105+
| $W$ | PPL (957 tok) | vs FP32 |
106+
|---:|---:|---:|
107+
| 0 | 14.08 | +3.8% |
108+
| 64 | 13.71 | +1.1% |
109+
| **128** | **13.64** | **+0.6%** |
110+
| 256 | 13.64 | +0.6% |
111+
112+
### 4.4 Memory and Speed (32K context)
113+
114+
| Config | KV Memory | Speed |
115+
|---|---:|---:|
116+
| FP32 | 7.17 GB | 6.9 tok/s |
117+
| 4-bit + k128 | 2.33 GB | 7.8 tok/s (+13%) |
118+
119+
### 4.5 Negative Result: 2-bit Compression
120+
121+
At 957 tokens, 2-bit + k512 showed PPL +4.3% (k512 = 53.5% FP32). We initially claimed this "Pareto-dominated" flat 4-bit. At 3,970 tokens (k512 = 12.9% FP32), PPL collapsed to +36.7%.
122+
123+
**Root cause:** At 957 tokens, the 512-token FP32 window covered more than half the evaluation, masking the 2-bit degradation. This is a general hazard of evaluating KV compression at short context lengths with large FP32 windows.
124+
125+
### 4.6 Layer-Adaptive Analysis (Negative Result)
126+
127+
Post-RHT kurtosis variation is insufficient for per-layer adaptation to provide meaningful benefit (~0.9% max). This is a positive finding for method simplicity.
128+
129+
---
130+
131+
## 5. Engineering Contributions
132+
133+
### 5.1 Single-Header Implementation
134+
135+
The complete method — RHT, Lloyd-Max codebooks, progressive window, infinite scrollback, KV persistence, NEON/AVX2 SIMD kernels — is implemented in `quant.h` (16K LOC, 654 KB) with zero dependencies beyond libc.
136+
137+
### 5.2 O(n log n) BPE Tokenizer
138+
139+
The standard BPE merge algorithm is O(n²). For GPT-style byte-level BPE, a 17K-character text produces ~17K initial tokens, making naive merging impractical (~289M operations). We implemented a max-heap with lazy deletion, reducing merge complexity to O(n log n). This was necessary to enable the 3,970-token evaluation that caught the 2-bit artifact.
140+
141+
### 5.3 Distribution
142+
143+
- **PyPI:** `pip install quantcpp` (pre-built wheels for Linux x86_64/aarch64, macOS arm64)
144+
- **WASM:** 193 KB browser demo with IndexedDB model caching
145+
- **Model registry:** Auto-download from HuggingFace (`Model.from_pretrained("Llama-3.2-1B")`)
146+
147+
---
148+
149+
## 6. Discussion
150+
151+
### 6.1 Relationship to KVTC
152+
153+
KVTC [7] uses the same 128-token sliding window but adds PCA dimensionality reduction and entropy coding for the compressed region. Our approach is simpler (binary FP32/4-bit) and achieves comparable quality. We view this as convergent evidence that recency-based precision allocation is a robust principle.
154+
155+
### 6.2 Self-Correction as Methodology
156+
157+
Our project maintains a public correction log (10 self-found corrections, 0 external reports). The 2-bit Pareto claim (#10) was caught by our own evaluation infrastructure improvements (BPE fix → longer eval → honest measurement). We believe systematic self-validation — measuring, doubting, re-measuring at harder conditions — is as important as the algorithmic contribution.
158+
159+
### 6.3 Limitations
160+
161+
1. Single model (Llama 3.2 3B). Multi-model validation needed.
162+
2. CPU-only speed measurements. GPU behavior may differ.
163+
3. Maximum evaluated context: 3,970 tokens. 32K+ validation pending.
164+
4. V cache not progressively compressed (FP16 throughout).
165+
166+
---
167+
168+
## 7. Conclusion
169+
170+
We provide independent empirical evidence that recency-based KV cache precision allocation — keeping 128 recent tokens at FP32 — achieves FP32 quality at 3× compression, confirming findings from KVTC [7] and PM-KVQ [8] in a simpler setting. Our additional contributions — the RHT spatial analysis, the retracted 2-bit result, the context-length invariance measurement, and the single-header open-source implementation — complement the existing literature with practical validation and honest methodology.
171+
172+
---
173+
174+
## References
175+
176+
[1] Z. Liu et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. arXiv:2402.02750.
177+
178+
[2] C. Hooper et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." NeurIPS 2024. arXiv:2401.18079.
179+
180+
[3] Y. He et al. "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification." NeurIPS 2024. arXiv:2405.14256.
181+
182+
[4] V. Malinovskii et al. "HIGGS: Pushing the Limits of Large Language Model Quantization via the Linearity Theorem." NAACL 2025. arXiv:2411.17525.
183+
184+
[5] A. Zandieh et al. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874.
185+
186+
[6] G. Xiao et al. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv:2309.17453.
187+
188+
[7] KVTC. "KV Cache Transform Coding." ICLR 2026. arXiv:2511.01815.
189+
190+
[8] Liu et al. "PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs." 2025. arXiv:2505.18610.
191+
192+
[9] "More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression." EMNLP 2025. arXiv:2412.12706.
193+
194+
---
195+
196+
**Reproducibility.** All code: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp). Install: `pip install quantcpp`. Benchmark artifacts in `bench/results/`.
197+
198+
**Correction log.** 10 self-corrections documented in [CHANGELOG.md](https://github.com/quantumaikr/quant.cpp/blob/main/CHANGELOG.md). Correction #10 (2-bit Pareto retraction) is discussed in Section 4.5.

0 commit comments

Comments
 (0)