|
| 1 | +# quant.cpp: Practical KV Cache Compression in 67K Lines of C |
| 2 | + |
| 3 | +## Abstract |
| 4 | + |
| 5 | +We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache compression with zero perplexity degradation. The engine is implemented in 67K lines of C11 with no external dependencies, and ships as a 15K-line single-header library (quant.h) embeddable in any C project. We implement seven quantization algorithms for KV cache compression, including PolarQuant, QJL, and a novel delta compression scheme that enables 3-bit key quantization at only +1.3% PPL. On a 16GB Mac, quant.cpp extends context length from 50K to 350K tokens for Llama 3.2 3B, and from 4K to 30K tokens for Gemma 4 26B-A4B (128-expert MoE). We describe the architecture, quantization plugin system, and lessons learned from GPU acceleration experiments on Apple Silicon. |
| 6 | + |
| 7 | +## 1. Introduction |
| 8 | + |
| 9 | +Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts. |
| 10 | + |
| 11 | +Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget. |
| 12 | + |
| 13 | +quant.cpp is designed around three principles: |
| 14 | +1. **Readable**: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines). |
| 15 | +2. **Embeddable**: The single-header quant.h (15K lines) compiles with `cc app.c -lm -lpthread`. |
| 16 | +3. **Extensible**: Adding a new quantization type requires implementing three functions and registering them in a trait table. |
| 17 | + |
| 18 | +## 2. Architecture |
| 19 | + |
| 20 | +### 2.1 Quantization Plugin System |
| 21 | + |
| 22 | +Each KV quantization type is defined by a trait struct: |
| 23 | + |
| 24 | +```c |
| 25 | +typedef struct { |
| 26 | + const char* name; |
| 27 | + int block_size; // elements per block (typically 128) |
| 28 | + size_t type_size; // bytes per block |
| 29 | + void (*quantize)(const float* src, void* dst, int n); |
| 30 | + void (*dequantize)(const void* src, float* dst, int n); |
| 31 | + void (*attention)(const float* q, const void* kv, float* scores, int seq, int dim); |
| 32 | +} tq_type_traits_t; |
| 33 | +``` |
| 34 | + |
| 35 | +Seven types are implemented: |
| 36 | + |
| 37 | +| Type | Bits | Algorithm | Block Size | PPL vs FP32 | |
| 38 | +|------|------|-----------|------------|-------------| |
| 39 | +| TQ_UNIFORM_4B | 4 | Min-max | 128 | +0.0% | |
| 40 | +| TQ_UNIFORM_2B | 2 | Min-max | 128 | varies | |
| 41 | +| TQ_POLAR_3B | 3 | PolarQuant | 128 | +0.8% | |
| 42 | +| TQ_POLAR_4B | 4 | PolarQuant | 128 | +0.0% | |
| 43 | +| TQ_QJL_1B | 1 | QJL sign hash | 256 | +3.2% | |
| 44 | +| TQ_TURBO_3B | 3 | Polar 2b + QJL 1b | 128 | +1.0% | |
| 45 | +| TQ_TURBO_4B | 4 | Polar 3b + QJL 1b | 128 | +0.0% | |
| 46 | + |
| 47 | +### 2.2 Delta Compression |
| 48 | + |
| 49 | +Standard KV caching stores each key vector independently. We observe that adjacent key vectors (positions t and t-1) differ by ~30% of their absolute range. Delta mode stores `key[t] - reconstruct(key[t-1])`, reducing the dynamic range and enabling 3-bit quantization. |
| 50 | + |
| 51 | +Every 64 tokens, an FP32 I-frame is stored (like video compression) to bound drift accumulation. This yields 3-bit compression at +1.3% PPL, compared to +62% without delta encoding. |
| 52 | + |
| 53 | +### 2.3 QK-Norm Aware Compression |
| 54 | + |
| 55 | +Models with QK-norm (Gemma 4) normalize key vectors to the unit sphere, creating extremely sparse distributions (256 dimensions, ~56 active). We find that 4-bit quantization achieves only 0.62 cosine similarity on QK-normed keys — destroying directional information. |
| 56 | + |
| 57 | +Our solution: auto-detect QK-normed models and store keys in FP32 while quantizing only values to Q4. This preserves perfect key precision with 3.5x value memory reduction. |
| 58 | + |
| 59 | +## 3. Supported Architectures |
| 60 | + |
| 61 | +quant.cpp supports seven model architectures: |
| 62 | +- **Llama 3** (GQA, standard RoPE) |
| 63 | +- **Qwen 3.5** (DeltaNet hybrid attention) |
| 64 | +- **Gemma 3/4** (sliding + full attention, 4 norms/layer) |
| 65 | +- **Gemma 4 MoE** (128 experts, dual-FFN, learned RoPE, GeGLU) |
| 66 | +- **Qwen MoE** (256 experts, shared expert) |
| 67 | + |
| 68 | +The Gemma 4 26B-A4B-it implementation required solving 10 architecture-specific issues including dual-FFN parallel execution, layer_output_scale semantics, and attention_scale=1.0 for QK-normed models. |
| 69 | + |
| 70 | +## 4. GPU Acceleration Experiments |
| 71 | + |
| 72 | +We conducted extensive Metal GPU experiments on Apple M1 Pro: |
| 73 | + |
| 74 | +| Approach | SmolLM2 135M | vs CPU | |
| 75 | +|----------|-------------|--------| |
| 76 | +| CPU NEON Q4×Q8 fused dot | 96 tok/s | 1.0x | |
| 77 | +| Per-matmul Metal dispatch | 38 tok/s | 0.4x | |
| 78 | +| 2-commit GPU graph | 18 tok/s | 0.2x | |
| 79 | +| 1-commit GPU graph | 22 tok/s | 0.2x | |
| 80 | +| + Weight repacking | 27 tok/s | 0.3x | |
| 81 | +| + uint16 mask kernel | 27 tok/s | 0.3x | |
| 82 | + |
| 83 | +**Finding**: For batch-1 token generation on Apple Silicon unified memory, CPU NEON saturates memory bandwidth more efficiently than GPU due to command buffer dispatch overhead (~0.3ms per commit). GPU acceleration requires a tensor graph IR (like ggml) that compiles the entire forward pass into a single GPU dispatch — effectively building a GPU inference framework from scratch. |
| 84 | + |
| 85 | +## 5. Performance |
| 86 | + |
| 87 | +### 5.1 Speed |
| 88 | + |
| 89 | +| Model | Params | tok/s (M1 Pro) | |
| 90 | +|-------|--------|---------------| |
| 91 | +| SmolLM2 135M | 135M | 96 | |
| 92 | +| Llama 3.2 3B | 3B | 17 | |
| 93 | +| Gemma 4 26B-A4B | 26B (4B active) | 3.9 | |
| 94 | + |
| 95 | +### 5.2 KV Compression Quality |
| 96 | + |
| 97 | +WikiText-2 PPL on SmolLM2 1.7B: |
| 98 | + |
| 99 | +| Config | PPL | vs FP32 | Compression | |
| 100 | +|--------|-----|---------|-------------| |
| 101 | +| FP32 baseline | 14.63 | — | 1.0x | |
| 102 | +| 4b K + FP16 V | 14.63 | +0.00% | 1.6x | |
| 103 | +| 4b K + Q4 V | 14.57 | -0.4% | 6.9x | |
| 104 | +| Delta 3b K + Q4 V | 14.82 | +1.3% | 8.5x | |
| 105 | +| llama.cpp Q4_0 KV | 16.18 | +10.6% | 3.8x | |
| 106 | + |
| 107 | +### 5.3 Context Extension |
| 108 | + |
| 109 | +On 16GB Mac M1 Pro: |
| 110 | + |
| 111 | +| Model | FP16 KV | quant.cpp KV | Gain | |
| 112 | +|-------|---------|-------------|------| |
| 113 | +| Llama 3.2 3B | 50K tokens | 350K tokens | 6.9x | |
| 114 | +| Gemma 4 26B MoE | 4K tokens | 30K tokens | 6.9x | |
| 115 | + |
| 116 | +## 6. Related Work |
| 117 | + |
| 118 | +- **TurboQuant** (Zandieh et al., ICLR 2026): KV cache compression theory |
| 119 | +- **QJL** (AAAI 2025): Quantized Johnson-Lindenstrauss transform |
| 120 | +- **PolarQuant** (AISTATS 2026): Polar coordinate quantization |
| 121 | +- **llama.cpp**: Production inference engine with Q4 KV quantization |
| 122 | +- **llm.c** (Karpathy): Minimal C training/inference, educational focus |
| 123 | + |
| 124 | +## 7. Conclusion |
| 125 | + |
| 126 | +quant.cpp demonstrates that practical KV cache compression is achievable in a minimal, embeddable codebase. The key insight is that independent K/V quantization with type-aware methods eliminates the quality degradation seen in uniform approaches. The project serves as both a production-ready library for embedding LLM inference in applications and a research platform for experimenting with new quantization algorithms. |
| 127 | + |
| 128 | +Code: https://github.com/quantumaikr/quant.cpp |
0 commit comments