|
2 | 2 | <img src="docs/assets/hero.png" alt="quant.cpp" width="600"> |
3 | 3 | </p> |
4 | 4 |
|
5 | | -<h3 align="center">LLM inference with 7x longer context — pure C, zero dependencies</h3> |
| 5 | +<h3 align="center">The single-header C reference engine for KV cache quantization research</h3> |
6 | 6 |
|
7 | 7 | <p align="center"> |
8 | | - Lossless KV cache compression. Also ships as <a href="#-single-header-mode"><b>quant.h</b></a> — a single-header library.<br> |
9 | | - 72K LOC. Embeddable. Read it in an afternoon. |
| 8 | + Implements <a href="https://arxiv.org/abs/2504.19874"><b>TurboQuant</b></a> (ICLR 2026), <a href="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <a href="https://arxiv.org/abs/2406.03482">QJL</a>, and 4 other KV quantization schemes.<br> |
| 9 | + 72K LOC pure C, zero dependencies. Ships as <a href="#-single-header-mode"><b>quant.h</b></a> — drop one file into any project.<br> |
| 10 | + Runs everywhere a C compiler does: <b>iOS · Android · WASM · MSVC · microcontrollers</b>. |
10 | 11 | </p> |
11 | 12 |
|
12 | 13 | <p align="center"> |
@@ -39,14 +40,30 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context, |
39 | 40 |
|
40 | 41 | ## The Result |
41 | 42 |
|
42 | | -> **Same hardware. 7x longer context. Zero quality loss.** |
| 43 | +> **Same hardware. 4–7x longer context. Quantized with verified perplexity.** |
43 | 44 |
|
44 | | -| Hardware | Model | FP16 KV | quant.cpp KV | Gain | |
45 | | -|:---------|:------|--------:|-------------:|-----:| |
46 | | -| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** | |
47 | | -| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** | |
48 | | -| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** | |
49 | | -| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** | |
| 45 | +| Hardware | Model | FP16 KV | quant.cpp KV | Gain | PPL Δ | |
| 46 | +|:---------|:------|--------:|-------------:|-----:|------:| |
| 47 | +| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** | +0.0% | |
| 48 | +| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** | +0.0% | |
| 49 | +| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** | +0.0% | |
| 50 | +| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** | +0.0% | |
| 51 | + |
| 52 | +PPL measured on WikiText-2, SmolLM2 1.7B baseline, `uniform_4b K + Q4 V` config. See [reproducible benchmark](bench/head_to_head/). |
| 53 | + |
| 54 | +## Why quant.cpp? |
| 55 | + |
| 56 | +In April 2026, **Google published TurboQuant** ([Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)) — near-optimal KV cache compression at 3 bits. The paper is brilliant, but the open-source landscape is fragmented: |
| 57 | + |
| 58 | +- 🦀 [Rust implementation](https://github.com/RecursiveIntell/turbo-quant) — needs Cargo, can't ship to mobile |
| 59 | +- 🐍 [PyTorch implementation](https://github.com/tonbistudio/turboquant-pytorch) — needs Python + Torch runtime |
| 60 | +- 🔥 [Multiple llama.cpp forks](https://github.com/ggml-org/llama.cpp/discussions/20969) — none merged, no convergence |
| 61 | +- 📝 [Reference Python](https://github.com/scos-lab/turboquant) — research only |
| 62 | + |
| 63 | +**quant.cpp is the only single-header C implementation.** One file. Zero dependencies. Runs on a phone, in a browser, inside a game engine, on a microcontroller. The places the others can't go. |
| 64 | + |
| 65 | +> **TurboQuant for the data center? Use Google's reference.** |
| 66 | +> **TurboQuant for everywhere else? Use quant.cpp.** |
50 | 67 |
|
51 | 68 | ## Get Started in 60 Seconds |
52 | 69 |
|
@@ -107,19 +124,32 @@ On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). q |
107 | 124 |
|
108 | 125 | Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters. |
109 | 126 |
|
110 | | -### vs every other engine |
111 | | - |
112 | | -| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT | |
113 | | -|:--|:---------:|:---------:|:----:|:---:|:-------:| |
114 | | -| KV compression | **3.8-6.9x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- | |
115 | | -| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ | |
116 | | -| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime | |
117 | | -| Embeddable | **single header** | -- | -- | -- | complex | |
118 | | -| WASM | **192KB** | -- | -- | -- | -- | |
119 | | -| GPU serving | basic | full | **best** | Metal | multi | |
120 | | - |
121 | | -> **Use llama.cpp** when you need speed. **Use vLLM** when you need throughput. |
122 | | -> **Use quant.cpp** when you need to fit more context in less memory — or embed LLM in your own app. |
| 127 | +### vs other TurboQuant implementations |
| 128 | + |
| 129 | +| | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant | |
| 130 | +|:--|:---------:|:------------------:|:------------------:|:-------------------:| |
| 131 | +| Language | **Pure C11** | Rust | Python | Python | |
| 132 | +| Single-header | **✅ quant.h (628KB)** | ❌ Cargo crate | ❌ pip install | ❌ | |
| 133 | +| Dependencies | **libc + libm** | Rust toolchain | PyTorch + CUDA | PyTorch | |
| 134 | +| iOS / Android | **✅** | ❌ | ❌ | ❌ | |
| 135 | +| WASM (browser) | **✅ 192KB** | ❌ | ❌ | ❌ | |
| 136 | +| MCU / embedded | **✅** | ❌ | ❌ | ❌ | |
| 137 | +| Windows MSVC | **✅** | ✅ | (Python) | (Python) | |
| 138 | +| GGUF model loading | **✅ 7 architectures** | ❌ | ❌ | research only | |
| 139 | +| End-to-end inference | **✅** | kernel only | kernel only | kernel only | |
| 140 | + |
| 141 | +### vs production inference engines |
| 142 | + |
| 143 | +| | quant.cpp | llama.cpp | vLLM | MLX | |
| 144 | +|:--|:---------:|:---------:|:----:|:---:| |
| 145 | +| KV quantization | **TurboQuant + 6 schemes** | Q8_0/Q5_0 (2x) | -- | -- | |
| 146 | +| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | |
| 147 | +| Embeddable | **single header** | library | library | framework | |
| 148 | +| Read in an afternoon | **✅** | ❌ | ❌ | ❌ | |
| 149 | +| GPU throughput | basic | full | **best** | Metal | |
| 150 | + |
| 151 | +> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving. |
| 152 | +> **Use quant.cpp** when you need to ship LLM inference inside something — an app, a game, a website, a device. |
123 | 153 |
|
124 | 154 | --- |
125 | 155 |
|
@@ -428,11 +458,16 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce |
428 | 458 |
|
429 | 459 | --- |
430 | 460 |
|
431 | | -## References |
| 461 | +## References & Citations |
| 462 | + |
| 463 | +quant.cpp is an independent implementation of published research. Please cite the original papers: |
| 464 | + |
| 465 | +- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874) |
| 466 | +- **PolarQuant** — *Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617) |
| 467 | +- **QJL** — *Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482) |
| 468 | +- [Google Research blog post on TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) |
432 | 469 |
|
433 | | -- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression theory |
434 | | -- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — Quantized JL transform |
435 | | -- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — Polar coordinate quantization |
| 470 | +If you use quant.cpp in academic work, please cite both the underlying paper(s) and this repository. |
436 | 471 |
|
437 | 472 | --- |
438 | 473 |
|
|
0 commit comments