Skip to content

Commit 41a8441

Browse files
unamedkrclaude
andcommitted
docs: update pivot plan with P0 profiling results
P0 bottleneck identified and fixed: - Root cause: tokenizer re-parsed from GGUF (32K tokens) + KV state double-allocated on every HTTP request - Fix: context reuse across requests (commit 6e39e64) - Result: 2.0 → 4.5 tok/s (2.3x on warm requests) - Remaining: tq_generate internal state recreation (separate PR) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6e39e64 commit 41a8441

1 file changed

Lines changed: 6 additions & 5 deletions

File tree

docs/plan/pivot_plan.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,12 @@
2424
목표: 같은 하드웨어에서 10+ tok/s (DFlash 기준 Phi-3.5는 6.5 tok/s 가능)
2525
```
2626

27-
### 병목 후보
28-
1. tokenizer 재로딩 (매 요청마다 `tq_load_tokenizer_from_gguf` 호출?)
29-
2. KV state 재할당
30-
3. thread pool 미활용 (매 요청마다 pthread 생성?)
31-
4. 불필요한 메모리 복사
27+
### 병목 진단 + 수정 결과
28+
1.**tokenizer 재로딩** — 매 요청마다 32K 토큰 파싱 → context 재사용으로 제거 (commit 6e39e64)
29+
2. ⚠️ **KV state 이중 재할당** — server + quant_generate 둘 다 재생성 → server측만 수정, 내부 tq_generate는 별도 PR
30+
3. 해당없음 (thread pool 이미 재사용)
31+
32+
**결과: ~2.0 tok/s → ~4.5 tok/s (warm, 2.3× 개선)**
3233

3334
## P1: KV 압축 실증
3435

0 commit comments

Comments
 (0)