Skip to content

Commit 0b3b958

Browse files
unamedkrclaude
andcommitted
HONEST: re-baseline all Llama numbers without Metal (Issue #16)
P3 (Metal compute graph) investigation revealed that the existing Metal backend is currently NET NEGATIVE on every model size we tested. The CMake default is TQ_BUILD_METAL=OFF, so end users were always getting the fast path — but our internal benchmarks built with -DTQ_BUILD_METAL=ON have been understating speed by 14-22% for the last several releases. Measurements (3 runs each, Llama 3.2 3B Instruct PPL eval): Build | KV type | tok/s ------------ | ------------ | ------ Metal ON | fp32 | 15.07 Metal OFF | fp32 | 17.87 (+19%) Metal ON | turbo_kv_4b | 14.17 Metal OFF | turbo_kv_4b | 16.53 (+17%) Metal ON | turbo_kv_5b | 13.43 Metal OFF | turbo_kv_5b | 15.33 (+14%) Cross-model: Model | Metal-OFF win -------------- | ------------- SmolLM2 135M | neutral Llama 3.2 1B | +13-17% Llama 3.2 3B | +14-22% Gemma 4 26B | +40% The Metal backend's per-matmul dispatch + commit + waitUntilCompleted pattern has overhead that exceeds the GPU benefit at batch-1 inference, even on the largest model we tested. This is the same dispatch-overhead issue that killed our previous full-compute-graph attempts. Updated: - README.md / README.ko.md: re-baseline tables and ASCII charts with the honest CPU-only numbers (Llama 3.2 3B FP32: 14.83 → 18.13 tok/s, turbo_kv_4b: 13.57 → 16.60 tok/s) - Cross-size validation table includes the no-Metal speed numbers - Build note explains the Metal trade-off and links to issue #16 - The relative gap (−7~−9% vs fp32) stays the same — both paths got the same +20% boost, so the conclusions about Pareto rankings are unchanged Filed Issue #16 documenting the investigation, action items (profile dispatch overhead, find threshold or remove), and out-of-scope items (don't add new Metal kernels until existing path is fixed). Reference: this is the THIRD honest correction in the v0.6.x series: 1. v0.6.0 'lossless 7x' → '+6.3% PPL' 2. v0.6.4 'turbo_kv beats fp32' → '-7% vs fp32 NEON' 3. v0.6.5 'measurements with Metal' → 'measurements without Metal' Each correction was caught before publishing widely. Validation works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e21cd43 commit 0b3b958

2 files changed

Lines changed: 31 additions & 29 deletions

File tree

README.ko.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -43,19 +43,20 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
4343

4444
> **같은 하드웨어. 4–7배 긴 컨텍스트. PPL 측정 + 공개.**
4545
46-
### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
46+
### Llama 3.2 3B Instruct — PPL × 속도 (FP32 KV = 13.56 PPL @ 18.13 tok/s)
4747

4848
> 9 라운드 Karpathy 루프로 quant-KV vs FP32-KV 속도 격차를 **−45%에서 −8%로** 줄였습니다. 5.8–7.1× 메모리 압축. fp32 raw 속도를 능가하지는 못하지만 **8% 이내까지 따라잡음.**
4949
5050
| KV 설정 | 블록 바이트 | 압축 | PPL | Δ vs FP32 | tok/s | vs FP32 속도 |
5151
|:--------|----:|----:|----:|----:|----:|----:|
52-
| FP32 reference (NEON) ||| 13.56 || 14.83 | baseline |
53-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
54-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
55-
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
56-
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
57-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
58-
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
52+
| FP32 reference ||| 13.56 || **18.13** | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **15.43** | **−14.9%** |
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 15.20 | −16.2% |
55+
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | **14.33** | **+5.7%** | **16.60** | **−8.4%** |
56+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 15.77 | −13.0% |
57+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
58+
59+
**빌드 노트**: 위 숫자는 CMake 기본값 `TQ_BUILD_METAL=OFF` (CPU-only)에서 측정. Metal 활성화 시 14-22% 더 느립니다 — batch-1 추론에서 dispatch overhead가 GPU 이득을 능가. CMake 기본값이 OFF이므로 사용자는 자동으로 빠른 path를 받습니다. [Issue #16](https://github.com/quantumaikr/quant.cpp/issues/16) 참고.
5960

6061
`turbo_kv_4b` (기본)와 `turbo_kv_5b` (quality)가 Pareto 추천: **5.8–7.1× 메모리 압축 + FP32 KV 92% 속도.** 전체 Karpathy 루프 이력: [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
6162

README.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -43,46 +43,47 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
4343

4444
> **Same hardware. 4–7x longer context. PPL measured and disclosed.**
4545
46-
### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV NEON = 13.56 PPL @ 14.8 tok/s)
46+
### Llama 3.2 3B Instruct — PPL × Speed (FP32 KV = 13.56 PPL @ 18.13 tok/s)
4747

4848
> 9 rounds of Karpathy iteration closed the quant-KV speed gap to FP32 KV from **−45% to −8%**, while delivering 5.8–7.1× memory compression. We do not (yet) beat fp32 in raw speed, but we get within 8% of it for ~7× less memory.
4949
5050
| KV Config | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
5151
|:----------|------------:|------------:|----:|----------:|------:|--------------:|
52-
| FP32 reference (NEON) ||| 13.56 || 14.83 | baseline |
53-
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **13.13** | **−11.5%** |
54-
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 12.7 | −14% |
55-
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **13.67** | **−7.8%** |
56-
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 13.4 | −9.6% |
57-
| `turbo_kv_3bo` 🧪 | 80 | 6.4× | 14.17 | +4.5% | 9.3 | −37% |
58-
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 11.7 | −21% |
52+
| FP32 reference ||| 13.56 || **18.13** | baseline |
53+
| **`turbo_kv_5b`** 🏆 quality | 88 | 5.8× | **13.65** | **+0.7%** | **15.43** | **−14.9%** |
54+
| `turbo_kv_4bo` 🧪 | 96 | 5.3× | 13.90 | +2.5% | 15.20 | −16.2% |
55+
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | **14.33** | **+5.7%** | **16.60** | **−8.4%** |
56+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 15.77 | −13.0% |
57+
| `uniform_4b` | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
5958
| llama.cpp `q4_0` KV (lit.) | ~70 | ~7.3× | ~14.99 | +10.6% |||
6059

60+
**Build note**: Numbers above are with CMake default `TQ_BUILD_METAL=OFF` (CPU-only). We previously published numbers with Metal enabled (commits before `2026-04-08`); those numbers were 14–22% slower on this hardware because the existing Metal matmul dispatch path has per-op overhead that exceeds the GPU benefit at batch-1 inference. CMake default is `OFF` — users get the fast CPU-only path automatically. See [issue #16](https://github.com/quantumaikr/quant.cpp/issues/16) for the Metal investigation.
61+
6162
```
6263
PPL Degradation vs FP32 Speed vs FP32 KV
6364
(lower is better) (higher is better)
6465
65-
turbo_kv_5b │█ +0.7% █████████ −11.5%
66-
turbo_kv_4bo │██▌ +2.5% ████████ −14%
67-
turbo_kv_4b ⭐ │█████ +5.7% ██████████ −7.8%
68-
turbo_kv_3b │█████████████ +13.3% █████████ −9.6%
69-
uniform_4b │██████ +7.7% ███████ −21%
66+
turbo_kv_5b │█ +0.7% █████████ −14.9%
67+
turbo_kv_4bo │██▌ +2.5% ████████ −16.2%
68+
turbo_kv_4b ⭐ │█████ +5.7% ██████████ −8.4%
69+
turbo_kv_3b │█████████████ +13.3% █████████ −13.0%
70+
uniform_4b │██████ +7.7% ███████ −26.8%
7071
llama.cpp q4_0 │██████████ +10.6% — (not measured)
71-
FP32 reference │ ← 0% 14.83 tok/s ←
72+
FP32 reference │ ← 0% 18.13 tok/s ←
7273
0% +5% +10% 0 25% 50% 75% 100%
7374
```
7475

7576
`turbo_kv_4b` (default) and `turbo_kv_5b` (quality) are the Pareto-optimal recommendations: **5.8–7.1× memory compression at 92% of FP32 KV speed.** Full Karpathy-loop history (9 rounds across 3 sessions) in [bench/results/turboquant_reproduction.md](bench/results/turboquant_reproduction.md).
7677

77-
### Cross-size validation (3 Llama-family models)
78+
### Cross-size validation (3 Llama-family models, all measured CPU-only)
7879

79-
| Model | turbo_kv_5b PPL Δ | turbo_kv_4b PPL Δ |
80-
|---|---:|---:|
81-
| SmolLM2 135M Instruct | +1.7% | +5.8% |
82-
| Llama 3.2 1B Instruct | **+0.7%** | +7.3% |
83-
| Llama 3.2 3B Instruct | **+0.7%** | +5.7% |
80+
| Model | FP32 baseline | turbo_kv_5b PPL Δ | turbo_kv_4b PPL Δ | turbo_kv_4b tok/s | vs FP32 speed |
81+
|---|---:|---:|---:|---:|---:|
82+
| SmolLM2 135M | 18.62 PPL @ 70.4 t/s | +1.7% | +5.8% | 60.2 | −14.5% |
83+
| Llama 3.2 1B | 16.88 PPL @ 41.1 t/s | +0.7% | +7.3% | 34.4 | −16.3% |
84+
| Llama 3.2 3B | 13.56 PPL @ 18.13 t/s | +0.7% | +5.7% | 16.60 | −8.4% |
8485

85-
`turbo_kv_5b` is consistently near-lossless across model sizes (~1% PPL Δ). `turbo_kv_4b` stays in the 5–8% range. **Recommendation**: use `turbo_kv_3b` only on models ≥ 3B parameters (the 8-level codebook is too coarse for small models — +61% PPL on Llama 3.2 1B).
86+
`turbo_kv_5b` is consistently near-lossless across model sizes (~1% PPL Δ). `turbo_kv_4b` stays in the 5–8% PPL range and runs at 84–92% of FP32 KV speed. **Recommendation**: use `turbo_kv_3b` only on models ≥ 3B parameters (the 8-level codebook is too coarse for small models — +61% PPL on Llama 3.2 1B).
8687

8788
> **About this comparison**: We previously published v0.6.3 release notes claiming `turbo_kv` beats `fp32` KV speed. That was an artifact of the fp32 attention path being unoptimized scalar — once we added NEON to the fp32 path (commit `4490c83`), the honest gap is `−7%` to `−12%`, not `+5%` to `+10%`. We've corrected the README and the v0.6.3 release notes.
8889

0 commit comments

Comments
 (0)