Metal GPU: layer-level batch + Q4 shader fix + threshold tuning

unamedkr · claude · unamedkr · commit 95ee4914e769 · 2026-04-04T11:01:10.000+09:00
Worker A (perf-dev): Metal TQ Q4 matmul
- Fixed nibble unpacking bug in matmul_tq_q4 shader (interleaved, not sequential)
- Full dispatch implementation replacing stub
- Threshold tuned: n &gt;= 8192 for immediate mode (avoids 0.15ms per-dispatch overhead)
- Batch mode dispatches at any size (overhead amortized)

Worker B (core-dev): Forward pass batch mode
- Batch scope moved to layer level (was inside individual projection blocks)
- Added flush after wo and w_down projections (fixes latent correctness bug
  where GPU results weren't flushed before CPU consumption)
- MoE shared expert also wrapped in layer batch scope

Merge Gate: 34/34 tests pass, both CPU and Metal builds clean,
30.9 tok/s on SmolLM2 (no regression from batch restructuring).

WBS v1.3 progress:
  [x] Phase 1: Core matmul GPU dispatch
  [x] Phase 2: Element-wise shaders (rmsnorm, silu, mul, add)
  [x] Batch mode restructured to layer level
  [ ] Phase 3: Full forward pass orchestrator (connect element-wise ops)
  [ ] Phase 4: Optimization (double buffering, fused attention)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_ops.c b/src/engine/tq_ops.c
@@ -867,10 +867,11 @@ void tq_matmul_q4(float* out, const float* x, const uint8_t* w_qs, const float*
         extern int tq_metal_matmul_q4(float*, const float*, const uint8_t*, const float*, int, int);
         /* GPU: batch mode (batched independent matmuls), or immediate for
          * very large matmuls where GPU throughput overcomes per-dispatch
-         * overhead (~0.1ms). For batch-1 inference on Apple Silicon unified
-         * memory, matmul is bandwidth-bound — Metal helps most for large
-         * output dimensions (classifier/logits) or when CPU is busy. */
-        if (tq_metal_batch_active() || n >= 512) {
+         * overhead (~0.15ms). For batch-1 inference on Apple Silicon unified
+         * memory, matmul is bandwidth-bound — Metal helps most when the
+         * output dimension is very large (e.g., classifier/logits). Smaller
+         * matmuls (attention, FFN) are faster on CPU via NEON Q4xQ8 path. */
+        if (tq_metal_batch_active() || n >= 8192) {
             int rc = tq_metal_matmul_q4(out, x, w_qs, w_scales, n, d);
             if (rc == 0) return;
         }