Commit 75719f5
PERF: revert prefetch attempt — no measurable benefit, restore clean gather
Round 8 (prefetch in gather loop) gave no measurable improvement on
Apple M1 Pro — within noise across 3 runs. The L1 prefetcher already
handles the strided pattern, and explicit __builtin_prefetch hints add
no value here. Reverting to keep the code clean.
Round 9 (strided per-position attention without gather) was considered
but rejected: would require either repeated query rotation per position
(slower) or a new traits ABI for pre-rotated single-block dot product
(invasive). Current gather + bulk attention is the local optimum for
this attention path on Apple Silicon.
Final honest numbers (3 runs, Llama 3.2 3B PPL eval):
fp32 14.63 t/s baseline
turbo_kv_4b 13.57 t/s −7.2% ⭐ default
turbo_kv_3b 13.13 t/s −10.2%
turbo_kv_5b 12.90 t/s −11.8% 🏆 quality
35/35 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 33b6315 commit 75719f5
1 file changed
Lines changed: 9 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1650 | 1650 | | |
1651 | 1651 | | |
1652 | 1652 | | |
1653 | | - | |
| 1653 | + | |
| 1654 | + | |
| 1655 | + | |
| 1656 | + | |
| 1657 | + | |
| 1658 | + | |
| 1659 | + | |
| 1660 | + | |
1654 | 1661 | | |
1655 | 1662 | | |
1656 | | - | |
1657 | | - | |
1658 | 1663 | | |
1659 | 1664 | | |
1660 | 1665 | | |
| |||
1667 | 1672 | | |
1668 | 1673 | | |
1669 | 1674 | | |
1670 | | - | |
| 1675 | + | |
1671 | 1676 | | |
1672 | 1677 | | |
1673 | 1678 | | |
| |||
0 commit comments