File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -14,13 +14,14 @@ Embeddable LLM inference in pure C. Also ships as [**quant.h**](#single-header-m
1414
1515## What quant.cpp does
1616
17- ** ~ 4x longer context on the same hardware.** KV cache compression reduces per-token memory by 3.8x , extending context proportionally.
17+ ** ~ 7x longer context on the same hardware.** KV cache compression reduces per-token memory by up to 6.9x , extending context proportionally.
1818
1919| Hardware | Model | FP16 KV | Compressed KV | Gain |
2020| ----------| -------| ---------| ---------------| ------|
21+ | 16GB Mac | Llama 3.2 3B (Q8) | ~ 50K tokens | ** ~ 350K tokens** | ** 6.9x** |
22+ | 16GB Mac | Gemma 4 26B MoE | ~ 4K tokens | ** ~ 30K tokens** | ** 6.9x** |
2123| 8GB Laptop | Llama 8B (Q4) | ~ 16K tokens | ~ 61K tokens | 3.8x |
2224| 16GB Mac Air | SmolLM2 1.7B | ~ 78K tokens | ~ 298K tokens | 3.8x |
23- | ** 16GB Mac** | ** Gemma 4 26B-A4B** | ** ~ 8K tokens** | ** ~ 20K tokens** | ** 3.5x** |
2425| 24GB RTX 3090 | Llama 8B (Q4) | ~ 147K tokens | ~ 559K tokens | 3.8x |
2526
2627* Estimates based on KV memory reduction. Actual context depends on available memory after model weights.*
You can’t perform that action at this time.
0 commit comments