QA fixes: docker-compose server service, docker docs, KO README links

unamedkr · claude · unamedkr · commit 6a403cf8713c · 2026-04-06T16:38:55.000+09:00
- docker-compose.yml: add server service with quant-server entrypoint,
  remove unused env vars, fix misleading port mapping on CLI service
- docs/docker.md: add quant-server section, fix binary size claim
- README.ko.md: add H2H benchmark + KV landscape links to docs table

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -419,6 +419,8 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
 |:-----|:-----|
 | **[API 레퍼런스](docs/api.md)** | quant.h + libturboquant 전체 C API (730줄) |
 | **[커스텀 양자화 가이드](docs/custom-quantization.md)** | 함수 3개로 새 KV 양자화 타입 추가 |
+| **[H2H 벤치마크](bench/head_to_head/)** | 재현 가능한 quant.cpp vs llama.cpp 비교 |
+| **[KV 압축 랜드스케이프](docs/blog/kv-cache-landscape.md)** | Eviction vs Architecture vs Compression 가이드 |
 | **[로드맵](ROADMAP.md)** | 프로젝트 방향과 계획 |
 | **[변경 이력](CHANGELOG.md)** | 버전별 릴리스 노트 |
 | **[기술 리포트](docs/papers/quant_cpp_tech_report.md)** | 아키텍처와 벤치마크 (Arxiv 초안) |
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,18 +1,10 @@
 services:
+  # CLI inference (one-shot)
   inference:
     build: .
     image: quant.cpp:latest
     volumes:
       - ./models:/models
-    environment:
-      # KV cache compression settings (passed as CLI args below)
-      - TQ_KV_TYPE=uniform_4b
-      - TQ_VALUE_QUANT=q4
-      - TQ_THREADS=4
-    ports:
-      - "8080:8080"
-    # Default: run model with KV compression
-    # Override command to change model path, prompt, or options
     command:
       - /models/model.gguf
       - -k
@@ -23,3 +15,22 @@ services:
       - "4"
       - -p
       - "Hello, world"
+
+  # OpenAI-compatible server (persistent)
+  server:
+    build: .
+    image: quant.cpp:latest
+    entrypoint: ["quant-server"]
+    volumes:
+      - ./models:/models
+    ports:
+      - "8080:8080"
+    command:
+      - /models/model.gguf
+      - -p
+      - "8080"
+      - -k
+      - uniform_4b
+      - -j
+      - "4"
+    restart: unless-stopped
diff --git a/docs/docker.md b/docs/docker.md
@@ -45,19 +45,34 @@ docker run -v ./models:/models -v ./data:/data quant.cpp \
     /models/model.gguf --ppl /data/wikitext.txt -k polar_3b -v q4
 ```
 
+## OpenAI-compatible Server
+
+Run `quant-server` in Docker for a persistent API endpoint:
+
+```bash
+docker run -v ./models:/models -p 8080:8080 \
+    --entrypoint quant-server quant.cpp \
+    /models/model.gguf -p 8080 -k uniform_4b -j 4
+
+# Test
+curl http://localhost:8080/v1/chat/completions \
+    -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
+```
+
 ## Docker Compose
 
-The included `docker-compose.yml` provides a preconfigured inference service:
+The `docker-compose.yml` provides two services:
 
 ```bash
-# Place your model at ./models/model.gguf, then:
-docker compose up
+# One-shot inference
+docker compose run inference /models/model.gguf -p "Hello" -k uniform_4b -v q4
 
-# Override the prompt:
-docker compose run inference /models/model.gguf -p "Your prompt here" -k turbo_3b -v q4
+# Persistent OpenAI-compatible server
+docker compose up server
+# → http://localhost:8080/v1/chat/completions
 ```
 
-Edit `docker-compose.yml` to change the default model path, KV compression type,
+Edit `docker-compose.yml` to change the model path, KV compression type,
 or thread count.
 
 ## KV Compression Options
@@ -79,4 +94,4 @@ Models are not baked into the image. Mount them at runtime:
 
 The final image is approximately 10MB:
 - Alpine base: ~7MB
-- quant binary: ~500KB (statically linked, zero dependencies)
+- quant + quant-server binaries: ~1MB total (statically linked, zero dependencies)