Skip to content

Commit 5bf3e8e

Browse files
unamedkrclaude
andcommitted
docs: honest speed comparison with llama.cpp
Add a benchmark table to 'How It Compares' section that states measured speeds vs llama.cpp on 4 representative models, and honestly names where we match (Q8_0 on 3B+) and where we lag (Q4_K_M, tiny models). Rationale: the README previously only compared KV compression quality, implying we were competitive on inference speed too. In reality: - We match llama.cpp on Llama 3.2 3B Q8_0 (105%) - We're at 52% on Phi-3.5 Q8_0 - We're at 18% on Phi-3.5 Q4_K_M (Q3_K still scalar) - We're at 13% on Llama 3.2 1B (llama.cpp is extremely optimized for tiny models) Also corrects '16K LOC' to accurate '17.6K LOC'. Being honest about limitations strengthens the "read it end-to-end" value proposition rather than claiming parity on every axis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent db0af26 commit 5bf3e8e

1 file changed

Lines changed: 19 additions & 2 deletions

File tree

README.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@
88
<p align="center">
99
Chunking was a workaround for small context windows. We just made it unnecessary.<br>
1010
6.4× KV compression brings full-document understanding to consumer hardware.<br>
11-
<code>pip install quantcpp</code> — 16K lines of C, zero dependencies.
11+
<code>pip install quantcpp</code> — 17.6K lines of C, zero dependencies.
1212
</p>
1313

1414
<table align="center">
1515
<tr>
1616
<td align="center"><b>7/7 vs 0/7</b><br>Beyond RAG measured</td>
1717
<td align="center"><b>6.4x compression</b><br>+3% PPL</td>
1818
<td align="center"><b>128K context</b><br>on 16GB Mac</td>
19-
<td align="center"><b>16K LOC</b><br>zero deps</td>
19+
<td align="center"><b>17.6K LOC</b><br>zero deps</td>
2020
</tr>
2121
</table>
2222

@@ -401,6 +401,23 @@ On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). q
401401

402402
Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters.
403403

404+
### vs llama.cpp: Inference speed (honest numbers)
405+
406+
Generation throughput, 50 tokens, 4 threads, CPU-only, Apple M1 Pro:
407+
408+
| Model | quant.cpp | llama.cpp | Ratio |
409+
|:--|:-:|:-:|:-:|
410+
| Llama 3.2 3B Q8_0 | **13.3 tok/s** | 12.6 tok/s | **105%**|
411+
| Phi-3.5-mini Q8_0 | 4.0 tok/s | 7.7 tok/s | 52% |
412+
| Phi-3.5-mini Q4_K_M | 2.9 tok/s | 16.0 tok/s | 18% |
413+
| Llama 3.2 1B Q8_0 | 38.3 tok/s | 289.7 tok/s | 13% |
414+
415+
**Where we match**: Q8_0 on 3B-class models — our NEON int8×int8 fused dot is competitive with llama.cpp's hand-tuned assembly.
416+
417+
**Where we lag**: Q4_K_M (mixed Q2_K/Q3_K/Q4_K) — llama.cpp has years of assembly tuning on these K-quant types. We have NEON for Q4_K and Q2_K but not Q3_K, and our implementations are ~2-5× slower than llama.cpp's tuned code.
418+
419+
**Why we still exist**: llama.cpp is ~500K LOC (C++/CUDA/Metal/Vulkan). quant.cpp is ~17.6K LOC of C with zero dependencies. If you want the fastest inference, use llama.cpp. If you want something you can read end-to-end, embed in a single `.c` file, or use as a research platform for new KV compression methods — we're the alternative.
420+
404421
### vs other TurboQuant implementations
405422

406423
| | quant.cpp | turbo-quant (Rust) | turboquant-pytorch | scos-lab/turboquant |

0 commit comments

Comments
 (0)