Commit 81c5f20
Apple AMX acceleration via Accelerate cblas_sgemv
Use Apple Accelerate framework's cblas_sgemv for FP32 matrix-vector
multiply on Apple Silicon. The AMX coprocessor handles BLAS operations
~40% faster than manual NEON for large matrices.
SmolLM2 1.7B on Apple M3 (4 threads, Q4 weights):
Before (NEON only): 25.3 tok/s
After (AMX/cblas): 35.1 tok/s (+39%)
Applied to tq_matmul (FP32 path) for matrices n>=64, d>=256.
Smaller matrices still use NEON (lower dispatch overhead).
Q4/Q8 fused matmul paths unchanged (already NEON-optimized).
34/34 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 3d76424 commit 81c5f20
1 file changed
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
20 | 24 | | |
21 | 25 | | |
22 | 26 | | |
| |||
231 | 235 | | |
232 | 236 | | |
233 | 237 | | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
234 | 241 | | |
235 | 242 | | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
236 | 254 | | |
237 | 255 | | |
238 | 256 | | |
| |||
0 commit comments