Skip to content

Commit 6a403cf

Browse files
unamedkrclaude
andcommitted
QA fixes: docker-compose server service, docker docs, KO README links
- docker-compose.yml: add server service with quant-server entrypoint, remove unused env vars, fix misleading port mapping on CLI service - docs/docker.md: add quant-server section, fix binary size claim - README.ko.md: add H2H benchmark + KV landscape links to docs table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1ac6402 commit 6a403cf

3 files changed

Lines changed: 44 additions & 16 deletions

File tree

README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -419,6 +419,8 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
419419
|:-----|:-----|
420420
| **[API 레퍼런스](docs/api.md)** | quant.h + libturboquant 전체 C API (730줄) |
421421
| **[커스텀 양자화 가이드](docs/custom-quantization.md)** | 함수 3개로 새 KV 양자화 타입 추가 |
422+
| **[H2H 벤치마크](bench/head_to_head/)** | 재현 가능한 quant.cpp vs llama.cpp 비교 |
423+
| **[KV 압축 랜드스케이프](docs/blog/kv-cache-landscape.md)** | Eviction vs Architecture vs Compression 가이드 |
422424
| **[로드맵](ROADMAP.md)** | 프로젝트 방향과 계획 |
423425
| **[변경 이력](CHANGELOG.md)** | 버전별 릴리스 노트 |
424426
| **[기술 리포트](docs/papers/quant_cpp_tech_report.md)** | 아키텍처와 벤치마크 (Arxiv 초안) |

docker-compose.yml

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,10 @@
11
services:
2+
# CLI inference (one-shot)
23
inference:
34
build: .
45
image: quant.cpp:latest
56
volumes:
67
- ./models:/models
7-
environment:
8-
# KV cache compression settings (passed as CLI args below)
9-
- TQ_KV_TYPE=uniform_4b
10-
- TQ_VALUE_QUANT=q4
11-
- TQ_THREADS=4
12-
ports:
13-
- "8080:8080"
14-
# Default: run model with KV compression
15-
# Override command to change model path, prompt, or options
168
command:
179
- /models/model.gguf
1810
- -k
@@ -23,3 +15,22 @@ services:
2315
- "4"
2416
- -p
2517
- "Hello, world"
18+
19+
# OpenAI-compatible server (persistent)
20+
server:
21+
build: .
22+
image: quant.cpp:latest
23+
entrypoint: ["quant-server"]
24+
volumes:
25+
- ./models:/models
26+
ports:
27+
- "8080:8080"
28+
command:
29+
- /models/model.gguf
30+
- -p
31+
- "8080"
32+
- -k
33+
- uniform_4b
34+
- -j
35+
- "4"
36+
restart: unless-stopped

docs/docker.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,19 +45,34 @@ docker run -v ./models:/models -v ./data:/data quant.cpp \
4545
/models/model.gguf --ppl /data/wikitext.txt -k polar_3b -v q4
4646
```
4747

48+
## OpenAI-compatible Server
49+
50+
Run `quant-server` in Docker for a persistent API endpoint:
51+
52+
```bash
53+
docker run -v ./models:/models -p 8080:8080 \
54+
--entrypoint quant-server quant.cpp \
55+
/models/model.gguf -p 8080 -k uniform_4b -j 4
56+
57+
# Test
58+
curl http://localhost:8080/v1/chat/completions \
59+
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
60+
```
61+
4862
## Docker Compose
4963

50-
The included `docker-compose.yml` provides a preconfigured inference service:
64+
The `docker-compose.yml` provides two services:
5165

5266
```bash
53-
# Place your model at ./models/model.gguf, then:
54-
docker compose up
67+
# One-shot inference
68+
docker compose run inference /models/model.gguf -p "Hello" -k uniform_4b -v q4
5569

56-
# Override the prompt:
57-
docker compose run inference /models/model.gguf -p "Your prompt here" -k turbo_3b -v q4
70+
# Persistent OpenAI-compatible server
71+
docker compose up server
72+
# → http://localhost:8080/v1/chat/completions
5873
```
5974

60-
Edit `docker-compose.yml` to change the default model path, KV compression type,
75+
Edit `docker-compose.yml` to change the model path, KV compression type,
6176
or thread count.
6277

6378
## KV Compression Options
@@ -79,4 +94,4 @@ Models are not baked into the image. Mount them at runtime:
7994

8095
The final image is approximately 10MB:
8196
- Alpine base: ~7MB
82-
- quant binary: ~500KB (statically linked, zero dependencies)
97+
- quant + quant-server binaries: ~1MB total (statically linked, zero dependencies)

0 commit comments

Comments
 (0)