Skip to content

Commit 0fa8cec

Browse files
unamedkrclaude
andcommitted
docs: update README with measured 128K context data (M1 Pro 16GB)
Replace theoretical context table with REAL measured RSS: - Llama 3.2 3B + 128K context = 9.5 GB (6.4x compression) - FP32 at 128K would need ~30 GB → OOM on 16GB Mac - Generation speed: 6.6 tok/s at 16K (same as FP32) - Hero stat: "128K context on 16GB Mac" All numbers measured with /usr/bin/time -l on Apple M1 Pro. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5c79c8d commit 0fa8cec

1 file changed

Lines changed: 12 additions & 12 deletions

File tree

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
<tr>
1515
<td align="center"><b>3x less memory</b><br>same quality</td>
1616
<td align="center"><b>13% faster</b><br>than uncompressed</td>
17-
<td align="center"><b>32K context</b><br>on 8GB Mac</td>
17+
<td align="center"><b>128K context</b><br>on 16GB Mac</td>
1818
<td align="center"><b>16K LOC</b><br>zero deps</td>
1919
</tr>
2020
</table>
@@ -82,21 +82,21 @@ m = Model("model.gguf", progressive=True) # ← FP32 quality, 3x less memory, 1
8282

8383
---
8484

85-
## Your 8GB Mac Just Got 32K Context
85+
## 128K Context on 16GB Mac — Measured
8686

87-
KV compression isn't just smaller — it's **13% faster** (NEON `vqtbl1q_s8` table-lookup attention).
87+
Llama 3.2 3B with 6.4x KV compression. **Real RSS measured on M1 Pro 16GB:**
8888

89-
| Context | FP32 KV (8GB Mac) | KV compressed (8GB Mac) |
90-
|---:|---|---|
91-
| 4K | OK | **OK (+13% faster)** |
92-
| 16K | borderline | **OK** |
93-
| **32K** | **OOM** | **5.5 GB — fits** |
94-
| 64K | OOM | 16GB Mac OK |
95-
| 128K | OOM | 16GB Mac OK |
89+
| Context | FP32 KV | **quant.cpp 6.4x** | Savings | Speed |
90+
|---:|---:|---:|---:|---:|
91+
| 16K | 8.5 GB | **6.5 GB** | **-2.0 GB** | 6.6 tok/s |
92+
| 32K | 9.6 GB | **8.2 GB** | **-1.4 GB** | 4.9 tok/s |
93+
| 65K || **8.5 GB** || 1.6 tok/s |
94+
| **128K** | **OOM** | **9.5 GB** || 0.8 tok/s |
95+
96+
128K context with a 3B model in 9.5 GB. Generation speed is the same as FP32 (6.6 vs 6.5 tok/s at 16K).
9697

9798
```python
98-
m = Model("llama-3b.gguf", context_length=32768) # 3x compressed KV
99-
m = Model("llama-3b.gguf", context_length=32768, progressive=True) # + FP32 quality
99+
m = Model("llama-3b.gguf", aggressive=True, context_length=131072) # 128K in 9.5 GB
100100
```
101101

102102
---

0 commit comments

Comments
 (0)