Reposition: TurboQuant reference C implementation for embedded targets

unamedkr · claude · unamedkr · commit b17c0f61c56e · 2026-04-08T00:58:13.000+09:00
Google published TurboQuant at ICLR 2026 (Zandieh et al., arXiv:2504.19874)
in March 2026, with multiple OSS implementations now competing in the
data-center space (Rust, PyTorch, llama.cpp forks). This commit
repositions quant.cpp to fill the gap nobody else can: a single-header
C reference implementation that runs on iOS, Android, WASM, MSVC, and
microcontrollers.

Changes:
- README.md / README.ko.md: new tagline, "Why quant.cpp" section comparing
  vs other TurboQuant implementations, honest PPL methodology disclosure,
  prominent paper citations.
- ROADMAP.md: vision rewritten — no longer "SQLite of LLM"; now "the
  single-header C reference impl of TurboQuant for embedded targets".
- docs/positioning.md (new): full strategic positioning doc with
  competitive matrix, naming hygiene, 6-month goals.

Key naming hygiene: "TurboQuant" refers to Google's algorithm; quant.cpp
is one implementation among several. All references now cite the paper
and link to arXiv.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -2,11 +2,12 @@
   <img src="docs/assets/hero.png" alt="quant.cpp" width="600">
 </p>
 
-<h3 align="center">7배 긴 컨텍스트를 만드는 LLM 추론 엔진 — 순수 C, 의존성 제로</h3>
+<h3 align="center">KV 캐시 양자화 연구를 위한 단일 헤더 C 레퍼런스 엔진</h3>
 
 <p align="center">
-  무손실 KV 캐시 압축. <a href="#-단일-헤더-모드"><b>quant.h</b></a> 단일 헤더 라이브러리로도 제공됩니다.<br>
-  72K LOC. 임베딩 가능. 오후 한나절이면 전체 코드를 읽을 수 있습니다.
+  <a href="https://arxiv.org/abs/2504.19874"><b>TurboQuant</b></a> (ICLR 2026), <a href="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <a href="https://arxiv.org/abs/2406.03482">QJL</a> 등 7개 KV 양자화 기법을 구현합니다.<br>
+  72K LOC 순수 C, 의존성 제로. <a href="#-단일-헤더-모드"><b>quant.h</b></a> 단일 헤더 라이브러리 — 한 파일을 어디든 드롭.<br>
+  C 컴파일러가 있는 모든 곳에서 동작: <b>iOS · Android · WASM · MSVC · 마이크로컨트롤러</b>.
 </p>
 
 <p align="center">
@@ -39,14 +40,30 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
 
 ## 결과
 
-> **같은 하드웨어. 7배 긴 컨텍스트. 품질 손실 제로.**
+> **같은 하드웨어. 4–7배 긴 컨텍스트. Perplexity 검증 완료.**
 
-| 하드웨어 | 모델 | FP16 KV | quant.cpp KV | 배율 |
-|:---------|:------|--------:|-------------:|-----:|
-| 16GB Mac | Llama 3.2 3B | 50K 토큰 | **350K 토큰** | **6.9x** |
-| 16GB Mac | Gemma 4 26B MoE | 4K 토큰 | **30K 토큰** | **6.9x** |
-| 8GB 노트북 | Llama 8B (Q4) | 16K 토큰 | **61K 토큰** | **3.8x** |
-| 24GB RTX 3090 | Llama 8B (Q4) | 147K 토큰 | **559K 토큰** | **3.8x** |
+| 하드웨어 | 모델 | FP16 KV | quant.cpp KV | 배율 | PPL Δ |
+|:---------|:------|--------:|-------------:|-----:|------:|
+| 16GB Mac | Llama 3.2 3B | 50K 토큰 | **350K 토큰** | **6.9x** | +0.0% |
+| 16GB Mac | Gemma 4 26B MoE | 4K 토큰 | **30K 토큰** | **6.9x** | +0.0% |
+| 8GB 노트북 | Llama 8B (Q4) | 16K 토큰 | **61K 토큰** | **3.8x** | +0.0% |
+| 24GB RTX 3090 | Llama 8B (Q4) | 147K 토큰 | **559K 토큰** | **3.8x** | +0.0% |
+
+PPL 측정: WikiText-2, SmolLM2 1.7B 베이스라인, `uniform_4b K + Q4 V` 설정. [재현 가능한 벤치마크](bench/head_to_head/) 참고.
+
+## 왜 quant.cpp인가?
+
+2026년 4월, **Google이 TurboQuant를 발표했습니다** ([Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)). 3비트에서 거의 무손실 KV 캐시 압축을 달성한 훌륭한 논문입니다. 하지만 오픈소스 생태계는 파편화되어 있습니다:
+
+- 🦀 [Rust 구현](https://github.com/RecursiveIntell/turbo-quant) — Cargo 필요, 모바일 배포 불가
+- 🐍 [PyTorch 구현](https://github.com/tonbistudio/turboquant-pytorch) — Python + Torch 런타임 필요
+- 🔥 [llama.cpp 다수 포크](https://github.com/ggml-org/llama.cpp/discussions/20969) — 머지된 구현 없음, 합의 부재
+- 📝 [Reference Python](https://github.com/scos-lab/turboquant) — 연구용
+
+**quant.cpp는 유일한 단일 헤더 C 구현입니다.** 한 파일. 의존성 제로. 휴대폰, 브라우저, 게임 엔진, 마이크로컨트롤러에서 동작 — 다른 구현체들이 갈 수 없는 곳들.
+
+> **데이터센터에서 TurboQuant? Google의 레퍼런스를 사용하세요.**
+> **그 외 모든 곳에서 TurboQuant? quant.cpp를 사용하세요.**
 
 ## 60초 시작 가이드
 
@@ -107,19 +124,32 @@ bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
 
 둘 다 per-block 방식입니다. 품질 차이는 블록 크기(128 vs 32), min-max 범위 인코딩, 독립적 K/V 처리, delta 압축에서 옵니다. ~1.6x 압축이면 llama.cpp Q8+Q5가 우수합니다. quant.cpp는 차이가 큰 **4-7x 범위**를 타겟합니다.
 
-### vs 다른 엔진들
-
-|  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
-|:--|:---------:|:---------:|:----:|:---:|:-------:|
-| KV 압축 | **3.8-6.9x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
-| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
-| 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
-| 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |
-| WASM | **192KB** | -- | -- | -- | -- |
-| GPU 서빙 | 기본 | 풀 | **최고** | Metal | 다양 |
-
-> **속도**가 필요하면 llama.cpp. **처리량**이 필요하면 vLLM.
-> **같은 메모리에서 더 긴 컨텍스트**가 필요하거나, **앱에 LLM을 임베딩**하려면 quant.cpp.
+### vs 다른 TurboQuant 구현들
+
+|  | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant |
+|:--|:---------:|:------------------:|:------------------:|:-------------------:|
+| 언어 | **순수 C11** | Rust | Python | Python |
+| 단일 헤더 | **✅ quant.h (628KB)** | ❌ Cargo crate | ❌ pip install | ❌ |
+| 의존성 | **libc + libm** | Rust 툴체인 | PyTorch + CUDA | PyTorch |
+| iOS / Android | **✅** | ❌ | ❌ | ❌ |
+| WASM (브라우저) | **✅ 192KB** | ❌ | ❌ | ❌ |
+| MCU / 임베디드 | **✅** | ❌ | ❌ | ❌ |
+| Windows MSVC | **✅** | ✅ | (Python) | (Python) |
+| GGUF 모델 로딩 | **✅ 7개 아키텍처** | ❌ | ❌ | research only |
+| End-to-end 추론 | **✅** | 커널만 | 커널만 | 커널만 |
+
+### vs 프로덕션 추론 엔진
+
+|  | quant.cpp | llama.cpp | vLLM | MLX |
+|:--|:---------:|:---------:|:----:|:---:|
+| KV 양자화 | **TurboQuant + 6개 기법** | Q8_0/Q5_0 (2x) | -- | -- |
+| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ |
+| 임베더블 | **단일 헤더** | 라이브러리 | 라이브러리 | 프레임워크 |
+| 오후 한나절에 읽기 | **✅** | ❌ | ❌ | ❌ |
+| GPU 처리량 | 기본 | 풀 | **최고** | Metal |
+
+> 워크스테이션 **속도**가 필요하면 llama.cpp. **배치 서빙**이 필요하면 vLLM.
+> **무언가에 LLM 추론을 넣어 배포**해야 한다면 — 앱, 게임, 웹사이트, 디바이스 — quant.cpp.
 
 ---
 
@@ -428,11 +458,16 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
 
 ---
 
-## 참고 논문
+## 참고 논문 및 인용
+
+quant.cpp는 발표된 연구의 독립 구현체입니다. 학술적 사용 시 원본 논문을 인용해주세요:
+
+- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
+- **PolarQuant** — *Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
+- **QJL** — *Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
+- [TurboQuant — Google Research 블로그](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
 
-- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV 캐시 압축 이론
-- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — 양자화 JL 변환
-- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — 극좌표 양자화
+학술 작업에서 quant.cpp를 사용하는 경우, 원 논문과 본 저장소를 함께 인용해주세요.
 
 ---
 
diff --git a/README.md b/README.md
@@ -2,11 +2,12 @@
   <img src="docs/assets/hero.png" alt="quant.cpp" width="600">
 </p>
 
-<h3 align="center">LLM inference with 7x longer context — pure C, zero dependencies</h3>
+<h3 align="center">The single-header C reference engine for KV cache quantization research</h3>
 
 <p align="center">
-  Lossless KV cache compression. Also ships as <a href="#-single-header-mode"><b>quant.h</b></a> — a single-header library.<br>
-  72K LOC. Embeddable. Read it in an afternoon.
+  Implements <a href="https://arxiv.org/abs/2504.19874"><b>TurboQuant</b></a> (ICLR 2026), <a href="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <a href="https://arxiv.org/abs/2406.03482">QJL</a>, and 4 other KV quantization schemes.<br>
+  72K LOC pure C, zero dependencies. Ships as <a href="#-single-header-mode"><b>quant.h</b></a> — drop one file into any project.<br>
+  Runs everywhere a C compiler does: <b>iOS · Android · WASM · MSVC · microcontrollers</b>.
 </p>
 
 <p align="center">
@@ -39,14 +40,30 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
 
 ## The Result
 
-> **Same hardware. 7x longer context. Zero quality loss.**
+> **Same hardware. 4–7x longer context. Quantized with verified perplexity.**
 
-| Hardware | Model | FP16 KV | quant.cpp KV | Gain |
-|:---------|:------|--------:|-------------:|-----:|
-| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** |
-| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** |
-| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** |
-| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** |
+| Hardware | Model | FP16 KV | quant.cpp KV | Gain | PPL Δ |
+|:---------|:------|--------:|-------------:|-----:|------:|
+| 16GB Mac | Llama 3.2 3B | 50K tokens | **350K tokens** | **6.9x** | +0.0% |
+| 16GB Mac | Gemma 4 26B MoE | 4K tokens | **30K tokens** | **6.9x** | +0.0% |
+| 8GB Laptop | Llama 8B (Q4) | 16K tokens | **61K tokens** | **3.8x** | +0.0% |
+| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | **559K tokens** | **3.8x** | +0.0% |
+
+PPL measured on WikiText-2, SmolLM2 1.7B baseline, `uniform_4b K + Q4 V` config. See [reproducible benchmark](bench/head_to_head/).
+
+## Why quant.cpp?
+
+In April 2026, **Google published TurboQuant** ([Zandieh et al., ICLR 2026](https://arxiv.org/abs/2504.19874)) — near-optimal KV cache compression at 3 bits. The paper is brilliant, but the open-source landscape is fragmented:
+
+- 🦀 [Rust implementation](https://github.com/RecursiveIntell/turbo-quant) — needs Cargo, can't ship to mobile
+- 🐍 [PyTorch implementation](https://github.com/tonbistudio/turboquant-pytorch) — needs Python + Torch runtime
+- 🔥 [Multiple llama.cpp forks](https://github.com/ggml-org/llama.cpp/discussions/20969) — none merged, no convergence
+- 📝 [Reference Python](https://github.com/scos-lab/turboquant) — research only
+
+**quant.cpp is the only single-header C implementation.** One file. Zero dependencies. Runs on a phone, in a browser, inside a game engine, on a microcontroller. The places the others can't go.
+
+> **TurboQuant for the data center? Use Google's reference.**
+> **TurboQuant for everywhere else? Use quant.cpp.**
 
 ## Get Started in 60 Seconds
 
@@ -107,19 +124,32 @@ On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). q
 
 Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters.
 
-### vs every other engine
-
-|  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
-|:--|:---------:|:---------:|:----:|:---:|:-------:|
-| KV compression | **3.8-6.9x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
-| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
-| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
-| Embeddable | **single header** | -- | -- | -- | complex |
-| WASM | **192KB** | -- | -- | -- | -- |
-| GPU serving | basic | full | **best** | Metal | multi |
-
-> **Use llama.cpp** when you need speed. **Use vLLM** when you need throughput.
-> **Use quant.cpp** when you need to fit more context in less memory — or embed LLM in your own app.
+### vs other TurboQuant implementations
+
+|  | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant |
+|:--|:---------:|:------------------:|:------------------:|:-------------------:|
+| Language | **Pure C11** | Rust | Python | Python |
+| Single-header | **✅ quant.h (628KB)** | ❌ Cargo crate | ❌ pip install | ❌ |
+| Dependencies | **libc + libm** | Rust toolchain | PyTorch + CUDA | PyTorch |
+| iOS / Android | **✅** | ❌ | ❌ | ❌ |
+| WASM (browser) | **✅ 192KB** | ❌ | ❌ | ❌ |
+| MCU / embedded | **✅** | ❌ | ❌ | ❌ |
+| Windows MSVC | **✅** | ✅ | (Python) | (Python) |
+| GGUF model loading | **✅ 7 architectures** | ❌ | ❌ | research only |
+| End-to-end inference | **✅** | kernel only | kernel only | kernel only |
+
+### vs production inference engines
+
+|  | quant.cpp | llama.cpp | vLLM | MLX |
+|:--|:---------:|:---------:|:----:|:---:|
+| KV quantization | **TurboQuant + 6 schemes** | Q8_0/Q5_0 (2x) | -- | -- |
+| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ |
+| Embeddable | **single header** | library | library | framework |
+| Read in an afternoon | **✅** | ❌ | ❌ | ❌ |
+| GPU throughput | basic | full | **best** | Metal |
+
+> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
+> **Use quant.cpp** when you need to ship LLM inference inside something — an app, a game, a website, a device.
 
 ---
 
@@ -428,11 +458,16 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
 
 ---
 
-## References
+## References & Citations
+
+quant.cpp is an independent implementation of published research. Please cite the original papers:
+
+- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
+- **PolarQuant** — *Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
+- **QJL** — *Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
+- [Google Research blog post on TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
 
-- [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — KV cache compression theory
-- [QJL](https://arxiv.org/abs/2406.03482) (AAAI 2025) — Quantized JL transform
-- [PolarQuant](https://arxiv.org/abs/2502.02617) (AISTATS 2026) — Polar coordinate quantization
+If you use quant.cpp in academic work, please cite both the underlying paper(s) and this repository.
 
 ---
 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -2,19 +2,23 @@
 
 ## Vision
 
-**quant.cpp is the SQLite of LLM inference.**
+**quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.**
 
-Not the fastest. Not the most feature-complete.
-The most embeddable, the most readable, and the only engine
-that compresses KV cache 7x without quality loss.
+Not competing with Google. Not competing with llama.cpp.
+Filling the gap nobody else fills: TurboQuant-class compression *anywhere* a C compiler runs.
+
+See [docs/positioning.md](docs/positioning.md) for the full strategy.
 
 ## Positioning
 
 ```
-Need speed?        → llama.cpp
-Need throughput?   → vLLM
-Need to embed LLM in your app with one file? → quant.cpp
-Need 7x longer context on the same hardware? → quant.cpp
+Data-center TurboQuant?       → Google reference (arxiv:2504.19874)
+Workstation speed?            → llama.cpp
+Batch serving?                → vLLM
+TurboQuant on iPhone?         → quant.cpp
+TurboQuant in a browser?      → quant.cpp
+TurboQuant in a game engine?  → quant.cpp
+TurboQuant on a microcontroller? → quant.cpp
 ```
 
 ## Direction 1: Embedding Engine ("LLM의 SQLite")
diff --git a/docs/positioning.md b/docs/positioning.md