Skip to content

Commit 7dbd4ad

Browse files
unamedkrclaude
andcommitted
Metal GPU matmul dispatch connected to forward pass
WBS v1.3 Phase 1 progress: - tq_matmul_gguf() now dispatches to Metal GPU for supported types (IQ2_XXS, IQ2_S, Q8_0, Q4_K) when Metal is available - Batch mode: GPU dispatch when transformer wraps ops in batch - Immediate mode: GPU for out_dim >= 512, CPU for smaller - CPU fallback: transparent when Metal returns unsupported type - Both CPU and Metal builds compile clean, 34/34 tests pass Current limitation: Q4 load-time converted weights use internal format that doesn't match GGUF Metal shaders. Native GGUF weights (IQ2, Q8_0, Q4_K_M without conversion) will trigger GPU path. Next: extend batch mode to cover entire forward pass per layer, reducing per-dispatch overhead from ~30 dispatches to 1 per token. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0f6f78c commit 7dbd4ad

1 file changed

Lines changed: 32 additions & 3 deletions

File tree

src/engine/tq_gguf_quants.c

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1691,9 +1691,38 @@ void tq_matmul_gguf(float* out, const float* x,
16911691
const void* weight, tq_ggml_dtype weight_type,
16921692
int out_dim, int in_dim)
16931693
{
1694-
/* Per-matmul Metal dispatch DISABLED — slower than CPU fused dot
1695-
* due to dispatch overhead. MoE uses tq_metal_moe_forward() instead
1696-
* which batches all experts in 3 dispatches per layer. */
1694+
/* Metal GPU dispatch for supported GGUF types.
1695+
*
1696+
* Two modes:
1697+
* 1. Batch mode: when tq_metal_batch_begin_if_available() was called
1698+
* by the transformer, all matmuls encode into one command buffer.
1699+
* Dispatch overhead is amortized across the batch.
1700+
* 2. Immediate mode: for large matmuls (out_dim >= 512), the GPU
1701+
* throughput beats CPU even with per-call dispatch overhead.
1702+
* Small matmuls stay on CPU where fused dot is faster.
1703+
*
1704+
* Returns 0 on success, -1 if type unsupported or Metal unavailable.
1705+
* On -1, falls through to CPU path below. */
1706+
#ifdef TQ_HAS_METAL
1707+
{
1708+
extern int tq_metal_available(void);
1709+
extern int tq_metal_matmul_gguf(float*, const float*, const void*,
1710+
tq_ggml_dtype, int, int);
1711+
extern int tq_metal_batch_active(void);
1712+
1713+
if (tq_metal_available()) {
1714+
/* In batch mode, always dispatch to GPU (overhead is amortized).
1715+
* In immediate mode, only for large matrices where GPU wins. */
1716+
int use_gpu = tq_metal_batch_active() || (out_dim >= 512);
1717+
if (use_gpu) {
1718+
int rc = tq_metal_matmul_gguf(out, x, weight, weight_type,
1719+
out_dim, in_dim);
1720+
if (rc == 0) return; /* GPU handled it */
1721+
/* rc == -1: unsupported type, fall through to CPU */
1722+
}
1723+
}
1724+
}
1725+
#endif
16971726

16981727
const size_t block_bytes = tq_ggml_type_size(weight_type);
16991728
const int block_elems = tq_ggml_type_blck(weight_type);

0 commit comments

Comments
 (0)