|
| 1 | +# Show HN: quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac) |
| 2 | + |
| 3 | +**URL**: https://github.com/quantumaikr/quant.cpp |
| 4 | + |
| 5 | +## Title (≤80 chars) |
| 6 | +Show HN: quant.cpp – 7x longer LLM context via KV cache compression, pure C |
| 7 | + |
| 8 | +## Post |
| 9 | + |
| 10 | +I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) with one goal: **extend context length without adding hardware**. |
| 11 | + |
| 12 | +**The key insight**: LLM inference memory is dominated by the KV cache, not model weights. Compressing the KV cache to 4-bit keys + Q4 values gives **6.9x memory reduction** with negligible quality loss. |
| 13 | + |
| 14 | +**Real numbers on a 16GB Mac (M1 Pro)**: |
| 15 | + |
| 16 | +| Model | FP16 KV (llama.cpp) | Compressed KV (quant.cpp) | Gain | |
| 17 | +|-------|---------------------|---------------------------|------| |
| 18 | +| Llama 3.2 3B | ~50K tokens | **~350K tokens** | 6.9x | |
| 19 | +| Gemma 4 26B-A4B (MoE) | ~4K tokens | **~30K tokens** | 6.9x | |
| 20 | + |
| 21 | +**How it works**: |
| 22 | +- Keys: uniform 4-bit min-max quantization per 128-element block |
| 23 | +- Values: Q4 nibble quantization with per-block scales |
| 24 | +- Delta mode: store key[t] - key[t-1] instead of absolute keys (like video P-frames), enabling 3-bit at +1.3% PPL |
| 25 | +- QK-norm aware: models like Gemma 4 automatically use FP32 keys + Q4 values (sparse key distributions break low-bit quantization) |
| 26 | + |
| 27 | +**Quality** (WikiText-2 PPL, SmolLM2 1.7B): |
| 28 | +- FP32 baseline: 14.63 |
| 29 | +- 4-bit K + Q4 V: 14.57 (+0.0%) |
| 30 | +- Delta 3-bit K + Q4 V: 14.82 (+1.3%) |
| 31 | + |
| 32 | +**vs llama.cpp Q4 KV**: llama.cpp Q4_0 KV gives PPL +10.6%. quant.cpp gives +0.0%. Same bit budget, 10x less degradation. |
| 33 | + |
| 34 | +**Code philosophy**: 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header `quant.h` (15K LOC) you can drop into any C project. |
| 35 | + |
| 36 | +Supported models: Llama 3.2, Qwen 3.5, Gemma 3/4, MoE (128 experts). |
| 37 | + |
| 38 | +```bash |
| 39 | +./quant model.gguf -p "hello" -k uniform_4b -v q4 # that's it |
| 40 | +``` |
| 41 | + |
| 42 | +Feedback welcome. Particularly interested in: (1) what context length you'd need for your use case, (2) which models to prioritize next. |
| 43 | + |
| 44 | +## Talking Points for Comments |
| 45 | + |
| 46 | +- **"Why not just use llama.cpp?"** — llama.cpp is fast. quant.cpp goes further. Same model, same hardware: llama.cpp runs out of memory at 8K context, quant.cpp keeps going to 30K. Different tools for different problems. |
| 47 | + |
| 48 | +- **"How does delta compression work?"** — Adjacent key vectors differ by ~30% of their range. Instead of storing absolute 3-bit keys (PPL +62%), we store 3-bit deltas (PPL +1.3%). Every 64 tokens, an FP32 I-frame prevents drift. Same idea as video compression. |
| 49 | + |
| 50 | +- **"What about GPU?"** — Metal shaders included and working. But the bigger win is memory: GPU VRAM is even more constrained than system RAM, making KV compression even more valuable there. |
| 51 | + |
| 52 | +- **"Single header?"** — Yes, `quant.h` is 15K lines. `#define QUANT_IMPLEMENTATION` in one .c file, compile with `cc app.c -lm -lpthread`. Full GGUF loading, tokenization, and inference. No cmake, no build system. |
0 commit comments