Metal GPU matmul dispatch connected to forward pass

unamedkr · claude · unamedkr · commit 7dbd4ade606d · 2026-04-04T10:42:19.000+09:00
WBS v1.3 Phase 1 progress:
- tq_matmul_gguf() now dispatches to Metal GPU for supported types
  (IQ2_XXS, IQ2_S, Q8_0, Q4_K) when Metal is available
- Batch mode: GPU dispatch when transformer wraps ops in batch
- Immediate mode: GPU for out_dim &gt;= 512, CPU for smaller
- CPU fallback: transparent when Metal returns unsupported type
- Both CPU and Metal builds compile clean, 34/34 tests pass

Current limitation: Q4 load-time converted weights use internal
format that doesn't match GGUF Metal shaders. Native GGUF weights
(IQ2, Q8_0, Q4_K_M without conversion) will trigger GPU path.

Next: extend batch mode to cover entire forward pass per layer,
reducing per-dispatch overhead from ~30 dispatches to 1 per token.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/engine/tq_gguf_quants.c b/src/engine/tq_gguf_quants.c
@@ -1691,9 +1691,38 @@ void tq_matmul_gguf(float* out, const float* x,
                     const void* weight, tq_ggml_dtype weight_type,
                     int out_dim, int in_dim)
 {
-    /* Per-matmul Metal dispatch DISABLED — slower than CPU fused dot
-     * due to dispatch overhead. MoE uses tq_metal_moe_forward() instead
-     * which batches all experts in 3 dispatches per layer. */
+    /* Metal GPU dispatch for supported GGUF types.
+     *
+     * Two modes:
+     *   1. Batch mode: when tq_metal_batch_begin_if_available() was called
+     *      by the transformer, all matmuls encode into one command buffer.
+     *      Dispatch overhead is amortized across the batch.
+     *   2. Immediate mode: for large matmuls (out_dim >= 512), the GPU
+     *      throughput beats CPU even with per-call dispatch overhead.
+     *      Small matmuls stay on CPU where fused dot is faster.
+     *
+     * Returns 0 on success, -1 if type unsupported or Metal unavailable.
+     * On -1, falls through to CPU path below. */
+#ifdef TQ_HAS_METAL
+    {
+        extern int tq_metal_available(void);
+        extern int tq_metal_matmul_gguf(float*, const float*, const void*,
+                                        tq_ggml_dtype, int, int);
+        extern int tq_metal_batch_active(void);
+
+        if (tq_metal_available()) {
+            /* In batch mode, always dispatch to GPU (overhead is amortized).
+             * In immediate mode, only for large matrices where GPU wins. */
+            int use_gpu = tq_metal_batch_active() || (out_dim >= 512);
+            if (use_gpu) {
+                int rc = tq_metal_matmul_gguf(out, x, weight, weight_type,
+                                              out_dim, in_dim);
+                if (rc == 0) return; /* GPU handled it */
+                /* rc == -1: unsupported type, fall through to CPU */
+            }
+        }
+    }
+#endif
 
     const size_t block_bytes = tq_ggml_type_size(weight_type);
     const int    block_elems = tq_ggml_type_blck(weight_type);