Commit 95ee491
Metal GPU: layer-level batch + Q4 shader fix + threshold tuning
Worker A (perf-dev): Metal TQ Q4 matmul
- Fixed nibble unpacking bug in matmul_tq_q4 shader (interleaved, not sequential)
- Full dispatch implementation replacing stub
- Threshold tuned: n >= 8192 for immediate mode (avoids 0.15ms per-dispatch overhead)
- Batch mode dispatches at any size (overhead amortized)
Worker B (core-dev): Forward pass batch mode
- Batch scope moved to layer level (was inside individual projection blocks)
- Added flush after wo and w_down projections (fixes latent correctness bug
where GPU results weren't flushed before CPU consumption)
- MoE shared expert also wrapped in layer batch scope
Merge Gate: 34/34 tests pass, both CPU and Metal builds clean,
30.9 tok/s on SmolLM2 (no regression from batch restructuring).
WBS v1.3 progress:
[x] Phase 1: Core matmul GPU dispatch
[x] Phase 2: Element-wise shaders (rmsnorm, silu, mul, add)
[x] Batch mode restructured to layer level
[ ] Phase 3: Full forward pass orchestrator (connect element-wise ops)
[ ] Phase 4: Optimization (double buffering, fused attention)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 29070f2 commit 95ee491
1 file changed
Lines changed: 5 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
867 | 867 | | |
868 | 868 | | |
869 | 869 | | |
870 | | - | |
871 | | - | |
872 | | - | |
873 | | - | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
874 | 875 | | |
875 | 876 | | |
876 | 877 | | |
| |||
0 commit comments