Add v1.3 plan: Full Metal GPU offload for Apple Silicon

unamedkr · claude · unamedkr · commit 0f6f78cd8f4b · 2026-04-04T10:32:44.000+09:00
PRD v1.3: target 80+ tok/s on SmolLM2 (vs current 35 tok/s CPU)
WBS v1.3: 4 phases — core matmul → element-wise → full forward → optimize

Key architecture decisions:
- Single command buffer per token (minimal sync)
- Zero-copy weights via unified memory (no upload needed)
- CPU fallback always available (GPU is optional acceleration)
- n &gt;= 256 threshold for GPU dispatch (small matmuls stay on CPU)

Existing Metal shaders: matmul_q4_k, matmul_q8_0, matmul_iq2_xxs, matmul_iq2_s
Missing: connecting these to the forward pass dispatch

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/plan/prd/prd_v1.3.md b/docs/plan/prd/prd_v1.3.md
@@ -0,0 +1,105 @@
+# PRD v1.3 — Full GPU Offload (Metal/Apple Silicon)
+
+## Overview
+
+현재 quant.cpp의 추론은 CPU에서 실행됩니다 (AMX 가속 포함, 35 tok/s on M3).
+ollama+MLX는 전체 forward pass를 Apple GPU에서 실행하여 50-100+ tok/s를 달성합니다.
+
+v1.3의 목표: **전체 transformer forward pass를 Metal GPU에서 실행.**
+
+## Target Performance
+
+| Metric | Current (CPU+AMX) | Target (Metal GPU) | Reference (ollama+MLX) |
+|--------|-------------------|--------------------|-----------------------|
+| SmolLM2 1.7B tok/s | 35 | **80+** | ~100 |
+| Qwen3.5 4B tok/s | 5.4 | **20+** | ~40 |
+| Latency per token | 28ms | **<15ms** | ~10ms |
+| GPU utilization | 0% | **>80%** | ~90% |
+
+## Why This Is Achievable
+
+Apple Silicon의 **통합 메모리**가 핵심 이점:
+- CPU와 GPU가 같은 메모리를 공유 — 데이터 복사 불필요
+- mmap된 모델 가중치를 GPU에서 직접 읽기 가능
+- llama.cpp Metal과 동일한 접근 방식
+
+## Architecture
+
+```
+Current:
+  token → [CPU] embed → [CPU] attn_norm → [CPU] QKV matmul → [CPU] attention
+       → [CPU] FFN matmul → [CPU] output_proj → logits
+
+Target:
+  token → [GPU] embed → [GPU] attn_norm → [GPU] QKV matmul → [GPU] attention
+       → [GPU] FFN matmul → [GPU] output_proj → [CPU] sampling → next token
+```
+
+### Metal Compute Shaders Needed
+
+| Shader | Input | Output | Priority |
+|--------|-------|--------|----------|
+| `matmul_q4_f32` | Q4 weights + FP32 vec | FP32 vec | P0 (90% of compute) |
+| `matmul_f32` | FP32 weights + FP32 vec | FP32 vec | P0 |
+| `rmsnorm` | FP32 vec + FP32 weights | FP32 vec | P1 |
+| `rope` | FP32 Q/K + position | FP32 Q/K | P1 |
+| `silu_elementwise` | FP32 gate + FP32 up | FP32 | P1 |
+| `softmax` | FP32 scores | FP32 probs | P1 |
+| `attention_fwd` | Q, K cache, V cache | FP32 output | P2 (fused) |
+| `add_residual` | FP32 + FP32 | FP32 | P2 |
+
+### Pipeline Design
+
+```
+1개 Command Buffer per token (최소 동기화):
+
+  encoder.setComputePipelineState(matmul_q4_pipeline)
+  encoder.setBuffer(weights_q,  0)  // Q projection weights (mmap)
+  encoder.setBuffer(input,      1)  // normalized input
+  encoder.setBuffer(output_q,   2)  // Q output
+  encoder.dispatchThreadgroups(...)
+
+  // ... K, V projection, RoPE, attention, FFN ...
+
+  commandBuffer.commit()
+  commandBuffer.waitUntilCompleted()  // 1회만, 토큰당
+```
+
+## Key Design Decisions
+
+1. **Single command buffer per token** — 셰이더 간 동기화 최소화
+2. **가중치는 mmap 그대로** — 통합 메모리이므로 GPU가 직접 접근
+3. **KV cache는 GPU 버퍼** — `MTLBuffer` with `storageModeShared`
+4. **Sampling만 CPU** — top-p sampling은 GPU에서 비효율적
+5. **Q4 dequant는 GPU에서** — matmul과 fused하여 대역폭 절약
+
+## Scope & Non-Goals
+
+### In Scope
+- Metal compute shaders for all forward pass ops
+- Apple Silicon (M1-M5) 지원
+- Q4_K_M, Q8_0 가중치 형식
+- 단일 시퀀스 추론 (batch=1)
+
+### Out of Scope (v1.3)
+- CUDA/Vulkan GPU offload (별도 버전)
+- Batched inference
+- Flash Attention
+- Continuous batching
+- Speculative decoding
+
+## Risk & Mitigation
+
+| Risk | Likelihood | Mitigation |
+|------|-----------|------------|
+| Per-dispatch overhead > compute gain (small models) | Medium | 큰 모델에서만 GPU 활성화 (dim >= 2048) |
+| Q4 dequant shader 정확도 | Low | llama.cpp Metal shader 참고 |
+| Command buffer 동기화 병목 | Medium | Double buffering, async commit |
+
+## Success Criteria
+
+1. SmolLM2 1.7B에서 **60+ tok/s** (현재 35)
+2. Qwen3.5 4B에서 **15+ tok/s** (현재 5.4)
+3. PPL 변화 없음 (GPU 계산 정확도 = CPU)
+4. 기존 CPU 경로 유지 (GPU 없는 환경 폴백)
+5. quant.h에는 영향 없음 (GPU는 full build only)
diff --git a/docs/plan/wbs/wbs_v1.3.md b/docs/plan/wbs/wbs_v1.3.md
@@ -0,0 +1,90 @@
+# WBS v1.3 — Full GPU Offload (Metal)
+
+## Phase 1: Core Metal Matmul (P0)
+
+추론의 90%를 차지하는 matmul을 GPU로 이동.
+
+- [ ] **1.1** Metal matmul shader: FP32 weight × FP32 vector
+  - `kernel void matmul_f32(device float* w, device float* x, device float* out, uint n, uint d)`
+  - Threadgroup: 각 output element를 1개 threadgroup이 계산
+  - Shared memory로 input vector 캐시
+  - File: `src/backend/metal/shaders/matmul.metal`
+
+- [ ] **1.2** Metal matmul shader: Q4 weight × FP32 vector (fused dequant)
+  - Q4_K_M block → FP32 dequant → dot product, 셰이더 내에서 융합
+  - llama.cpp `ggml-metal.metal` 참고
+  - File: `src/backend/metal/shaders/matmul_q4.metal`
+
+- [ ] **1.3** Metal dispatch wrapper for matmul
+  - `tq_metal_matmul(out, x, w, n, d)` — CPU/GPU 자동 선택
+  - GPU 버퍼 관리 (weights는 mmap shared, activations는 managed)
+  - File: `src/backend/metal/tq_metal_compute.m`
+
+- [ ] **1.4** tq_ops.c에서 matmul GPU 경로 연결
+  - `tq_matmul()`, `tq_matmul_q4()` → Metal dispatch 조건부 호출
+  - dim >= 1024일 때만 GPU (작은 matmul은 CPU가 빠름)
+
+- [ ] **1.5** 벤치마크: matmul-only GPU vs CPU
+  - SmolLM2 1.7B: matmul 시간 비교
+  - 목표: matmul 2x+ 속도 향상
+
+## Phase 2: Element-wise Ops (P1)
+
+matmul 사이의 ops를 GPU에서 실행하여 CPU↔GPU 동기화 제거.
+
+- [ ] **2.1** RMSNorm Metal shader
+  - L2 norm 계산 (reduction) + elementwise scale
+  - Atomic or parallel reduction
+
+- [ ] **2.2** RoPE Metal shader
+  - Per-head rotation: cos/sin computation + complex multiply
+  - Position encoding을 uniform buffer로 전달
+
+- [ ] **2.3** SiLU/GELU activation Metal shader
+  - Elementwise: `silu(x) = x * sigmoid(x)`
+  - Gate × Up projection 결과에 적용
+
+- [ ] **2.4** Softmax Metal shader
+  - Reduction for max → subtract → exp → reduction for sum → divide
+  - Attention scores에 적용
+
+- [ ] **2.5** Add/Residual Metal shader
+  - Elementwise add (trivial but needed to stay on GPU)
+
+## Phase 3: Full Forward Pass on GPU (P2)
+
+모든 ops를 연결하여 1 command buffer per token.
+
+- [ ] **3.1** GPU-side KV cache
+  - `MTLBuffer` (storageModeShared)로 KV cache 할당
+  - Key/Value 저장 + attention lookup 모두 GPU에서
+
+- [ ] **3.2** Forward pass orchestrator
+  - `tq_forward_metal()` — 1개 command buffer에 모든 연산 인코딩
+  - CPU fallback: Metal 미지원 환경에서 자동 CPU 경로
+
+- [ ] **3.3** Embedding lookup on GPU
+  - Token ID → embedding vector (GPU side gather)
+
+- [ ] **3.4** Output projection + sampling handoff
+  - Logit 계산까지 GPU → CPU로 결과 전송 → sampling
+
+- [ ] **3.5** 통합 벤치마크
+  - E2E tok/s: SmolLM2 1.7B, Qwen3.5 4B
+  - GPU utilization monitoring
+  - PPL 검증 (CPU와 동일해야 함)
+
+## Phase 4: 최적화 (P3)
+
+- [ ] **4.1** Double buffering — 이전 토큰 처리 중 다음 토큰 준비
+- [ ] **4.2** Fused attention kernel — QK matmul + softmax + V weighted sum 1개 셰이더
+- [ ] **4.3** Batched embedding dequant — 여러 행을 한 번에 dequant
+
+## Milestone 정의
+
+| Milestone | 목표 | 기준 |
+|-----------|------|------|
+| M1 (Phase 1) | matmul GPU 동작 | SmolLM2 matmul 2x 빠름 |
+| M2 (Phase 2) | 전체 ops GPU | CPU↔GPU 전환 0회/token |
+| M3 (Phase 3) | E2E GPU forward | 60+ tok/s on SmolLM2 |
+| M4 (Phase 4) | 최적화 | 80+ tok/s on SmolLM2 |