Skip to content

Commit fdf99ad

Browse files
unamedkrclaude
andcommitted
docs(bench): expand speed comparison with Gemma 4 E4B and correctness notes
Updated speed comparison table with: - Gemma 4 E4B Q8_0: 14% of llama.cpp (newly unblocked after PLE buffer fix) - Gemma 4 E4B Q4_0: 10% of llama.cpp - Gemma 4 E2B Q8_0: 9% (large-vocab lm_head is a consistent bottleneck) - Updated existing numbers to match current measurements Added 'Known correctness issues' section: - Qwen3.5-4B (qwen35 architecture, DeltaNet hybrid): forward pass runs without error but produces whitespace-only tokens. Separate debug needed. Added 'Large vocab models' to the 'Where we lag' section — Gemma 4's 262K vocab makes the lm_head matmul dominate, and llama.cpp's tiled Q8_0 matmul has a significant advantage there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d91a8b2 commit fdf99ad

1 file changed

Lines changed: 17 additions & 9 deletions

File tree

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -403,18 +403,26 @@ Both are per-block methods. The quality gap comes from block size (128 vs 32), m
403403

404404
### vs llama.cpp: Inference speed (honest numbers)
405405

406-
Generation throughput, 50 tokens, 4 threads, CPU-only, Apple M1 Pro:
406+
Generation throughput, 30 tokens, 4 threads, CPU-only, Apple M1 Pro:
407407

408408
| Model | quant.cpp | llama.cpp | Ratio |
409409
|:--|:-:|:-:|:-:|
410-
| Llama 3.2 3B Q8_0 | **13.3 tok/s** | 12.6 tok/s | **105%**|
411-
| Phi-3.5-mini Q8_0 | 4.0 tok/s | 7.7 tok/s | 52% |
412-
| Phi-3.5-mini Q4_K_M | 2.9 tok/s | 16.0 tok/s | 18% |
413-
| Llama 3.2 1B Q8_0 | 38.3 tok/s | 289.7 tok/s | 13% |
414-
415-
**Where we match**: Q8_0 on 3B-class models — our NEON int8×int8 fused dot is competitive with llama.cpp's hand-tuned assembly.
416-
417-
**Where we lag**: Q4_K_M (mixed Q2_K/Q3_K/Q4_K) — llama.cpp has years of assembly tuning on these K-quant types. We have NEON for Q4_K and Q2_K but not Q3_K, and our implementations are ~2-5× slower than llama.cpp's tuned code.
410+
| Llama 3.2 3B Q8_0 | **10.2 tok/s** | 13.5 tok/s | **75%**|
411+
| Phi-3.5-mini Q8_0 | 5.0 tok/s | 9.8 tok/s | 51% |
412+
| Phi-3.5-mini Q4_K_M | 2.7 tok/s | 17.5 tok/s | 15% |
413+
| Gemma 4 E2B Q8_0 | 16.3 tok/s | 175.8 tok/s | 9% |
414+
| Gemma 4 E4B Q8_0 | 4.8 tok/s | 34.8 tok/s | 14% |
415+
| Gemma 4 E4B Q4_0 | 4.8 tok/s | 50.5 tok/s | 10% |
416+
417+
**Where we're competitive**: Q8_0 on mid-size (3B-class) models — our NEON int8×int8 fused dot path approaches llama.cpp's hand-tuned assembly (75% on Llama 3.2 3B).
418+
419+
**Where we lag (2-10×)**:
420+
- **Q4_K_M** (mixed Q2_K/Q3_K/Q4_K) — llama.cpp has years of assembly tuning on K-quant types. We have NEON for Q4_K and Q2_K but not Q3_K.
421+
- **Large vocab models** (Gemma 4, 262K vocab) — the lm_head matmul alone is 2560×262144. llama.cpp's Q8_0 matmul benefits disproportionately from CPU-specific tiling we haven't implemented.
422+
- **Tiny models** (<1B) — llama.cpp's scheduling and batch overhead is optimized for cases where per-matmul work is small.
423+
424+
**Known correctness issues**:
425+
- **Qwen3.5-4B** (`architecture = qwen35`, DeltaNet hybrid) — forward pass runs without errors but produces whitespace-only output. llama.cpp produces `<think>...` reasoning. Root cause unidentified.
418426

419427
**Why we still exist**: llama.cpp is ~500K LOC (C++/CUDA/Metal/Vulkan). quant.cpp is ~17.6K LOC of C with zero dependencies. If you want the fastest inference, use llama.cpp. If you want something you can read end-to-end, embed in a single `.c` file, or use as a research platform for new KV compression methods — we're the alternative.
420428

0 commit comments

Comments
 (0)