Commit 0b3b958
HONEST: re-baseline all Llama numbers without Metal (Issue #16)
P3 (Metal compute graph) investigation revealed that the existing
Metal backend is currently NET NEGATIVE on every model size we
tested. The CMake default is TQ_BUILD_METAL=OFF, so end users were
always getting the fast path — but our internal benchmarks built with
-DTQ_BUILD_METAL=ON have been understating speed by 14-22% for the
last several releases.
Measurements (3 runs each, Llama 3.2 3B Instruct PPL eval):
Build | KV type | tok/s
------------ | ------------ | ------
Metal ON | fp32 | 15.07
Metal OFF | fp32 | 17.87 (+19%)
Metal ON | turbo_kv_4b | 14.17
Metal OFF | turbo_kv_4b | 16.53 (+17%)
Metal ON | turbo_kv_5b | 13.43
Metal OFF | turbo_kv_5b | 15.33 (+14%)
Cross-model:
Model | Metal-OFF win
-------------- | -------------
SmolLM2 135M | neutral
Llama 3.2 1B | +13-17%
Llama 3.2 3B | +14-22%
Gemma 4 26B | +40%
The Metal backend's per-matmul dispatch + commit + waitUntilCompleted
pattern has overhead that exceeds the GPU benefit at batch-1
inference, even on the largest model we tested. This is the same
dispatch-overhead issue that killed our previous full-compute-graph
attempts.
Updated:
- README.md / README.ko.md: re-baseline tables and ASCII charts with
the honest CPU-only numbers (Llama 3.2 3B FP32: 14.83 → 18.13 tok/s,
turbo_kv_4b: 13.57 → 16.60 tok/s)
- Cross-size validation table includes the no-Metal speed numbers
- Build note explains the Metal trade-off and links to issue #16
- The relative gap (−7~−9% vs fp32) stays the same — both paths got
the same +20% boost, so the conclusions about Pareto rankings are
unchanged
Filed Issue #16 documenting the investigation, action items
(profile dispatch overhead, find threshold or remove), and out-of-scope
items (don't add new Metal kernels until existing path is fixed).
Reference: this is the THIRD honest correction in the v0.6.x series:
1. v0.6.0 'lossless 7x' → '+6.3% PPL'
2. v0.6.4 'turbo_kv beats fp32' → '-7% vs fp32 NEON'
3. v0.6.5 'measurements with Metal' → 'measurements without Metal'
Each correction was caught before publishing widely. Validation works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent e21cd43 commit 0b3b958
2 files changed
Lines changed: 31 additions & 29 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
59 | 58 | | |
60 | 59 | | |
| 60 | + | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
70 | 71 | | |
71 | | - | |
| 72 | + | |
72 | 73 | | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
77 | | - | |
| 78 | + | |
78 | 79 | | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
84 | 85 | | |
85 | | - | |
| 86 | + | |
86 | 87 | | |
87 | 88 | | |
88 | 89 | | |
| |||
0 commit comments