Skip to content

Commit f969ee5

Browse files
unamedkrclaude
andcommitted
perf: switch to Phi-3.5-Q8_0 — 2x faster than Q4_K_M on NEON
Q8_0 (3.8GB): 3.0 tok/s — simple int8 dequant, NEON-friendly Q4_K_M (2.2GB): 1.5 tok/s — complex super-block dequant overhead Both produce identical quality output. Q8_0 is the better choice for Apple Silicon NEON where dequant cost dominates bandwidth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d06d0bc commit f969ee5

2 files changed

Lines changed: 6 additions & 3 deletions

File tree

bench/rlv/stages/_llm.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
# that produced garbage for Phi-3.5/SmolLM2. The unified server compiles
2929
# quant.h as a single translation unit — no sync issues.
3030
# Phi-3.5: ~1.15 tok/s (CPU NEON), ~6.5 tok/s reported in PR #79.
31+
# Q8_0 is 2x faster than Q4_K_M on NEON (simpler dequant, 3.0 vs 1.5 tok/s).
3132
DEFAULT_MODEL = REPO / "models" / "Phi-3.5-mini-instruct-Q8_0.gguf"
3233
DEFAULT_SERVER_BINARY = REPO / "build_metal" / "quant-server-unified"
3334
DEFAULT_SERVER_HOST = "127.0.0.1"

docs/plan/rlv_day5_m3_plan.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,9 +70,11 @@ TQ_NO_METAL 환경변수 코드 제거 (더 이상 불필요):
7070

7171
### 2.3 Phi-3.5 모델 다운로드
7272
```bash
73-
curl -L -o models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
74-
"https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/resolve/main/Phi-3.5-mini-instruct-Q4_K_M.gguf"
75-
# ~2.2GB
73+
# Q8_0 사용 (Q4_K_M 대비 2배 빠름: 3.0 vs 1.5 tok/s on NEON)
74+
# Q8_0은 단순 int8 dequant라 NEON SIMD에서 효율적, Q4_K_M은 복잡한 super-block 구조
75+
curl -L -o models/Phi-3.5-mini-instruct-Q8_0.gguf \
76+
"https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF/resolve/main/Phi-3.5-mini-instruct-Q8_0.gguf"
77+
# ~3.8GB
7678
```
7779

7880
### 2.4 unified 서버 빌드

0 commit comments

Comments
 (0)