Skip to content

Commit 2765d5e

Browse files
unamedkrclaude
andcommitted
README: update context length numbers from real benchmarks
Measured on 16GB Mac M1 Pro: - Llama 3.2 3B: 50K → 350K tokens (6.9x with uniform_4b + Q4V) - Gemma 4 26B MoE: 4K → 30K tokens (6.9x) - Hero message: ~4x → ~7x longer context Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 528582a commit 2765d5e

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@ Embeddable LLM inference in pure C. Also ships as [**quant.h**](#single-header-m
1414

1515
## What quant.cpp does
1616

17-
**~4x longer context on the same hardware.** KV cache compression reduces per-token memory by 3.8x, extending context proportionally.
17+
**~7x longer context on the same hardware.** KV cache compression reduces per-token memory by up to 6.9x, extending context proportionally.
1818

1919
| Hardware | Model | FP16 KV | Compressed KV | Gain |
2020
|----------|-------|---------|---------------|------|
21+
| 16GB Mac | Llama 3.2 3B (Q8) | ~50K tokens | **~350K tokens** | **6.9x** |
22+
| 16GB Mac | Gemma 4 26B MoE | ~4K tokens | **~30K tokens** | **6.9x** |
2123
| 8GB Laptop | Llama 8B (Q4) | ~16K tokens | ~61K tokens | 3.8x |
2224
| 16GB Mac Air | SmolLM2 1.7B | ~78K tokens | ~298K tokens | 3.8x |
23-
| **16GB Mac** | **Gemma 4 26B-A4B** | **~8K tokens** | **~20K tokens** | **3.5x** |
2425
| 24GB RTX 3090 | Llama 8B (Q4) | ~147K tokens | ~559K tokens | 3.8x |
2526

2627
*Estimates based on KV memory reduction. Actual context depends on available memory after model weights.*

0 commit comments

Comments
 (0)