Skip to content

Commit b17c0f6

Browse files
unamedkrclaude
andcommitted
Reposition: TurboQuant reference C implementation for embedded targets
Google published TurboQuant at ICLR 2026 (Zandieh et al., arXiv:2504.19874) in March 2026, with multiple OSS implementations now competing in the data-center space (Rust, PyTorch, llama.cpp forks). This commit repositions quant.cpp to fill the gap nobody else can: a single-header C reference implementation that runs on iOS, Android, WASM, MSVC, and microcontrollers. Changes: - README.md / README.ko.md: new tagline, "Why quant.cpp" section comparing vs other TurboQuant implementations, honest PPL methodology disclosure, prominent paper citations. - ROADMAP.md: vision rewritten — no longer "SQLite of LLM"; now "the single-header C reference impl of TurboQuant for embedded targets". - docs/positioning.md (new): full strategic positioning doc with competitive matrix, naming hygiene, 6-month goals. Key naming hygiene: "TurboQuant" refers to Google's algorithm; quant.cpp is one implementation among several. All references now cite the paper and link to arXiv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fc2640e commit b17c0f6

4 files changed

Lines changed: 266 additions & 62 deletions

File tree

README.ko.md

Lines changed: 62 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22
<img src="docs/assets/hero.png" alt="quant.cpp" width="600">
33
</p>
44

5-
<h3 align="center">7배 긴 컨텍스트를 만드는 LLM 추론 엔진 — 순수 C, 의존성 제로</h3>
5+
<h3 align="center">KV 캐시 양자화 연구를 위한 단일 헤더 C 레퍼런스 엔진</h3>
66

77
<p align="center">
8-
무손실 KV 캐시 압축. <a href="#-단일-헤더-모드"><b>quant.h</b></a> 단일 헤더 라이브러리로도 제공됩니다.<br>
9-
72K LOC. 임베딩 가능. 오후 한나절이면 전체 코드를 읽을 수 있습니다.
8+
<a href="https://arxiv.org/abs/2504.19874"><b>TurboQuant</b></a> (ICLR 2026), <a href="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <a href="https://arxiv.org/abs/2406.03482">QJL</a> 등 7개 KV 양자화 기법을 구현합니다.<br>
9+
72K LOC 순수 C, 의존성 제로. <a href="#-단일-헤더-모드"><b>quant.h</b></a> 단일 헤더 라이브러리 — 한 파일을 어디든 드롭.<br>
10+
C 컴파일러가 있는 모든 곳에서 동작: <b>iOS · Android · WASM · MSVC · 마이크로컨트롤러</b>.
1011
</p>
1112

1213
<p align="center">
@@ -39,14 +40,30 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
3940

4041
## 결과
4142

42-
> **같은 하드웨어. 7배 긴 컨텍스트. 품질 손실 제로.**
43+
> **같은 하드웨어. 4–7배 긴 컨텍스트. Perplexity 검증 완료.**
4344
44-
| 하드웨어 | 모델 | FP16 KV | quant.cpp KV | 배율 |
45-
|:---------|:------|--------:|-------------:|-----:|
46-
| 16GB Mac | Llama 3.2 3B | 50K 토큰 | **350K 토큰** | **6.9x** |
47-
| 16GB Mac | Gemma 4 26B MoE | 4K 토큰 | **30K 토큰** | **6.9x** |
48-
| 8GB 노트북 | Llama 8B (Q4) | 16K 토큰 | **61K 토큰** | **3.8x** |
49-
| 24GB RTX 3090 | Llama 8B (Q4) | 147K 토큰 | **559K 토큰** | **3.8x** |
45+
| 하드웨어 | 모델 | FP16 KV | quant.cpp KV | 배율 | PPL Δ |
46+
|:---------|:------|--------:|-------------:|-----:|------:|
47+
| 16GB Mac | Llama 3.2 3B | 50K 토큰 | **350K 토큰** | **6.9x** | +0.0% |
48+
| 16GB Mac | Gemma 4 26B MoE | 4K 토큰 | **30K 토큰** | **6.9x** | +0.0% |
49+
| 8GB 노트북 | Llama 8B (Q4) | 16K 토큰 | **61K 토큰** | **3.8x** | +0.0% |
50+
| 24GB RTX 3090 | Llama 8B (Q4) | 147K 토큰 | **559K 토큰** | **3.8x** | +0.0% |
51+
52+
PPL 측정: WikiText-2, SmolLM2 1.7B 베이스라인, `uniform_4b K + Q4 V` 설정. [재현 가능한 벤치마크](bench/head_to_head/) 참고.
53+
54+
## 왜 quant.cpp인가?
55+
56+
2026년 4월, **Google이 TurboQuant를 발표했습니다** ([Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)). 3비트에서 거의 무손실 KV 캐시 압축을 달성한 훌륭한 논문입니다. 하지만 오픈소스 생태계는 파편화되어 있습니다:
57+
58+
- 🦀 [Rust 구현](https://github.com/RecursiveIntell/turbo-quant) — Cargo 필요, 모바일 배포 불가
59+
- 🐍 [PyTorch 구현](https://github.com/tonbistudio/turboquant-pytorch) — Python + Torch 런타임 필요
60+
- 🔥 [llama.cpp 다수 포크](https://github.com/ggml-org/llama.cpp/discussions/20969) — 머지된 구현 없음, 합의 부재
61+
- 📝 [Reference Python](https://github.com/scos-lab/turboquant) — 연구용
62+
63+
**quant.cpp는 유일한 단일 헤더 C 구현입니다.** 한 파일. 의존성 제로. 휴대폰, 브라우저, 게임 엔진, 마이크로컨트롤러에서 동작 — 다른 구현체들이 갈 수 없는 곳들.
64+
65+
> **데이터센터에서 TurboQuant? Google의 레퍼런스를 사용하세요.**
66+
> **그 외 모든 곳에서 TurboQuant? quant.cpp를 사용하세요.**
5067
5168
## 60초 시작 가이드
5269

@@ -107,19 +124,32 @@ bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
107124

108125
둘 다 per-block 방식입니다. 품질 차이는 블록 크기(128 vs 32), min-max 범위 인코딩, 독립적 K/V 처리, delta 압축에서 옵니다. ~1.6x 압축이면 llama.cpp Q8+Q5가 우수합니다. quant.cpp는 차이가 큰 **4-7x 범위**를 타겟합니다.
109126

110-
### vs 다른 엔진들
111-
112-
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
113-
|:--|:---------:|:---------:|:----:|:---:|:-------:|
114-
| KV 압축 | **3.8-6.9x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
115-
| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
116-
| 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
117-
| 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |
118-
| WASM | **192KB** | -- | -- | -- | -- |
119-
| GPU 서빙 | 기본 || **최고** | Metal | 다양 |
120-
121-
> **속도**가 필요하면 llama.cpp. **처리량**이 필요하면 vLLM.
122-
> **같은 메모리에서 더 긴 컨텍스트**가 필요하거나, **앱에 LLM을 임베딩**하려면 quant.cpp.
127+
### vs 다른 TurboQuant 구현들
128+
129+
| | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant |
130+
|:--|:---------:|:------------------:|:------------------:|:-------------------:|
131+
| 언어 | **순수 C11** | Rust | Python | Python |
132+
| 단일 헤더 | **✅ quant.h (628KB)** | ❌ Cargo crate | ❌ pip install ||
133+
| 의존성 | **libc + libm** | Rust 툴체인 | PyTorch + CUDA | PyTorch |
134+
| iOS / Android | **** ||||
135+
| WASM (브라우저) | **✅ 192KB** ||||
136+
| MCU / 임베디드 | **** ||||
137+
| Windows MSVC | **** || (Python) | (Python) |
138+
| GGUF 모델 로딩 | **✅ 7개 아키텍처** ||| research only |
139+
| End-to-end 추론 | **** | 커널만 | 커널만 | 커널만 |
140+
141+
### vs 프로덕션 추론 엔진
142+
143+
| | quant.cpp | llama.cpp | vLLM | MLX |
144+
|:--|:---------:|:---------:|:----:|:---:|
145+
| KV 양자화 | **TurboQuant + 6개 기법** | Q8_0/Q5_0 (2x) | -- | -- |
146+
| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ |
147+
| 임베더블 | **단일 헤더** | 라이브러리 | 라이브러리 | 프레임워크 |
148+
| 오후 한나절에 읽기 | **** ||||
149+
| GPU 처리량 | 기본 || **최고** | Metal |
150+
151+
> 워크스테이션 **속도**가 필요하면 llama.cpp. **배치 서빙**이 필요하면 vLLM.
152+
> **무언가에 LLM 추론을 넣어 배포**해야 한다면 — 앱, 게임, 웹사이트, 디바이스 — quant.cpp.
123153
124154
---
125155

@@ -428,11 +458,16 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
428458

429459
---
430460

431-
## 참고 논문
461+
## 참고 논문 및 인용
462+
463+
quant.cpp는 발표된 연구의 독립 구현체입니다. 학술적 사용 시 원본 논문을 인용해주세요:
464+
465+
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
466+
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
467+
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
468+
- [TurboQuant — Google Research 블로그](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
432469

433-
- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV 캐시 압축 이론
434-
- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — 양자화 JL 변환
435-
- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — 극좌표 양자화
470+
학술 작업에서 quant.cpp를 사용하는 경우, 원 논문과 본 저장소를 함께 인용해주세요.
436471

437472
---
438473

README.md

Lines changed: 62 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22
<img src="docs/assets/hero.png" alt="quant.cpp" width="600">
33
</p>
44

5-
<h3 align="center">LLM inference with 7x longer context — pure C, zero dependencies</h3>
5+
<h3 align="center">The single-header C reference engine for KV cache quantization research</h3>
66

77
<p align="center">
8-
Lossless KV cache compression. Also ships as <a href="#-single-header-mode"><b>quant.h</b></a> — a single-header library.<br>
9-
72K LOC. Embeddable. Read it in an afternoon.
8+
Implements <a href="https://arxiv.org/abs/2504.19874"><b>TurboQuant</b></a> (ICLR 2026), <a href="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <a href="https://arxiv.org/abs/2406.03482">QJL</a>, and 4 other KV quantization schemes.<br>
9+
72K LOC pure C, zero dependencies. Ships as <a href="#-single-header-mode"><b>quant.h</b></a> — drop one file into any project.<br>
10+
Runs everywhere a C compiler does: <b>iOS · Android · WASM · MSVC · microcontrollers</b>.
1011
</p>
1112

1213
<p align="center">
@@ -39,14 +40,30 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
3940

4041
## The Result
4142

42-
> **Same hardware. 7x longer context. Zero quality loss.**
43+
> **Same hardware. 4–7x longer context. Quantized with verified perplexity.**
4344
44-
| Hardware | Model | FP16 KV | quant.cpp KV | Gain |
45-
|:---------|:------|--------:|-------------:|-----:|
46-
| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** |
47-
| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** |
48-
| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** |
49-
| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** |
45+
| Hardware | Model | FP16 KV | quant.cpp KV | Gain | PPL Δ |
46+
|:---------|:------|--------:|-------------:|-----:|------:|
47+
| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** | +0.0% |
48+
| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** | +0.0% |
49+
| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** | +0.0% |
50+
| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** | +0.0% |
51+
52+
PPL measured on WikiText-2, SmolLM2 1.7B baseline, `uniform_4b K + Q4 V` config. See [reproducible benchmark](bench/head_to_head/).
53+
54+
## Why quant.cpp?
55+
56+
In April 2026, **Google published TurboQuant** ([Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)) — near-optimal KV cache compression at 3 bits. The paper is brilliant, but the open-source landscape is fragmented:
57+
58+
- 🦀 [Rust implementation](https://github.com/RecursiveIntell/turbo-quant) — needs Cargo, can't ship to mobile
59+
- 🐍 [PyTorch implementation](https://github.com/tonbistudio/turboquant-pytorch) — needs Python + Torch runtime
60+
- 🔥 [Multiple llama.cpp forks](https://github.com/ggml-org/llama.cpp/discussions/20969) — none merged, no convergence
61+
- 📝 [Reference Python](https://github.com/scos-lab/turboquant) — research only
62+
63+
**quant.cpp is the only single-header C implementation.** One file. Zero dependencies. Runs on a phone, in a browser, inside a game engine, on a microcontroller. The places the others can't go.
64+
65+
> **TurboQuant for the data center? Use Google's reference.**
66+
> **TurboQuant for everywhere else? Use quant.cpp.**
5067
5168
## Get Started in 60 Seconds
5269

@@ -107,19 +124,32 @@ On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). q
107124

108125
Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters.
109126

110-
### vs every other engine
111-
112-
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
113-
|:--|:---------:|:---------:|:----:|:---:|:-------:|
114-
| KV compression | **3.8-6.9x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
115-
| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
116-
| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
117-
| Embeddable | **single header** | -- | -- | -- | complex |
118-
| WASM | **192KB** | -- | -- | -- | -- |
119-
| GPU serving | basic | full | **best** | Metal | multi |
120-
121-
> **Use llama.cpp** when you need speed. **Use vLLM** when you need throughput.
122-
> **Use quant.cpp** when you need to fit more context in less memory — or embed LLM in your own app.
127+
### vs other TurboQuant implementations
128+
129+
| | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant |
130+
|:--|:---------:|:------------------:|:------------------:|:-------------------:|
131+
| Language | **Pure C11** | Rust | Python | Python |
132+
| Single-header | **✅ quant.h (628KB)** | ❌ Cargo crate | ❌ pip install ||
133+
| Dependencies | **libc + libm** | Rust toolchain | PyTorch + CUDA | PyTorch |
134+
| iOS / Android | **** ||||
135+
| WASM (browser) | **✅ 192KB** ||||
136+
| MCU / embedded | **** ||||
137+
| Windows MSVC | **** || (Python) | (Python) |
138+
| GGUF model loading | **✅ 7 architectures** ||| research only |
139+
| End-to-end inference | **** | kernel only | kernel only | kernel only |
140+
141+
### vs production inference engines
142+
143+
| | quant.cpp | llama.cpp | vLLM | MLX |
144+
|:--|:---------:|:---------:|:----:|:---:|
145+
| KV quantization | **TurboQuant + 6 schemes** | Q8_0/Q5_0 (2x) | -- | -- |
146+
| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ |
147+
| Embeddable | **single header** | library | library | framework |
148+
| Read in an afternoon | **** ||||
149+
| GPU throughput | basic | full | **best** | Metal |
150+
151+
> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
152+
> **Use quant.cpp** when you need to ship LLM inference inside something — an app, a game, a website, a device.
123153
124154
---
125155

@@ -428,11 +458,16 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
428458

429459
---
430460

431-
## References
461+
## References & Citations
462+
463+
quant.cpp is an independent implementation of published research. Please cite the original papers:
464+
465+
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
466+
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
467+
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
468+
- [Google Research blog post on TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
432469

433-
- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression theory
434-
- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — Quantized JL transform
435-
- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — Polar coordinate quantization
470+
If you use quant.cpp in academic work, please cite both the underlying paper(s) and this repository.
436471

437472
---
438473

ROADMAP.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,23 @@
22

33
## Vision
44

5-
**quant.cpp is the SQLite of LLM inference.**
5+
**quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.**
66

7-
Not the fastest. Not the most feature-complete.
8-
The most embeddable, the most readable, and the only engine
9-
that compresses KV cache 7x without quality loss.
7+
Not competing with Google. Not competing with llama.cpp.
8+
Filling the gap nobody else fills: TurboQuant-class compression *anywhere* a C compiler runs.
9+
10+
See [docs/positioning.md](docs/positioning.md) for the full strategy.
1011

1112
## Positioning
1213

1314
```
14-
Need speed? → llama.cpp
15-
Need throughput? → vLLM
16-
Need to embed LLM in your app with one file? → quant.cpp
17-
Need 7x longer context on the same hardware? → quant.cpp
15+
Data-center TurboQuant? → Google reference (arxiv:2504.19874)
16+
Workstation speed? → llama.cpp
17+
Batch serving? → vLLM
18+
TurboQuant on iPhone? → quant.cpp
19+
TurboQuant in a browser? → quant.cpp
20+
TurboQuant in a game engine? → quant.cpp
21+
TurboQuant on a microcontroller? → quant.cpp
1822
```
1923

2024
## Direction 1: Embedding Engine ("LLM의 SQLite")

0 commit comments

Comments
 (0)