Fix LOC count (55K), server security, Dockerfile, gitignore

unamedkr · claude · unamedkr · commit f49e70de9e1b · 2026-04-05T11:13:50.000+09:00
- Correct LOC from 67K/33K to actual 55K across README/CLAUDE.md
- Server: socket timeout (slow-loris), connection limit, safe Content-Length
- Dockerfile: build quant-server, fix stale comment
- Gitignore: exclude generated benchmark results

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -45,6 +45,12 @@ models/.claude/worktrees/
 *.gguf
 models/
 
+# Benchmark results (machine-generated)
+bench/ablation_results/
+bench/kv_quality_results/
+bench/data/long_context_ppl_*.csv
+bench/data/long_context_ppl_*.json
+
 # Makefile build artifacts
 quant
 tq_convert
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -3,7 +3,7 @@
 ## Project Overview
 
 quant.cpp is a minimal C inference engine for local LLM with KV cache compression.
-33K LOC, pure C, zero dependencies. Supports 5 architectures via GGUF.
+55K LOC, pure C, zero dependencies. Supports 5 architectures via GGUF.
 Killer feature: delta KV compression — 3-bit keys with PPL -3.2% vs FP32.
 
 ## Architecture
diff --git a/Dockerfile b/Dockerfile
@@ -15,7 +15,9 @@ RUN cmake -B build \
         -DCMAKE_EXE_LINKER_FLAGS="-static" \
         -DTQ_BUILD_TESTS=OFF \
         -DTQ_BUILD_BENCH=OFF \
-    && cmake --build build -j$(nproc) --target quant
+        -DTQ_BUILD_SERVER=ON \
+    && cmake --build build -j$(nproc) --target quant \
+    && cmake --build build -j$(nproc) --target quant-server
 
 # ---- Runtime stage ----
 FROM alpine:3.20
@@ -25,13 +27,14 @@ LABEL org.opencontainers.image.title="quant.cpp" \
       org.opencontainers.image.description="LLM inference with 7x longer context — pure C, zero dependencies" \
       org.opencontainers.image.source="https://github.com/quantumaikr/quant.cpp"
 
-# Copy only the binary
+# Copy binaries
 COPY --from=builder /src/build/quant /usr/local/bin/quant
+COPY --from=builder /src/build/quant-server /usr/local/bin/quant-server
 
 # Create model mount point
 RUN mkdir -p /models
 
-# Future server mode
+# OpenAI-compatible server port
 EXPOSE 8080
 
 # Volume for GGUF model files
diff --git a/README.ko.md b/README.ko.md
@@ -6,7 +6,7 @@
 
 <p align="center">
   무손실 KV 캐시 압축. <a href="#-단일-헤더-모드"><b>quant.h</b></a> 단일 헤더 라이브러리로도 제공됩니다.<br>
-  67K LOC. 임베딩 가능. 오후 한나절이면 전체 코드를 읽을 수 있습니다.
+  55K LOC. 임베딩 가능. 오후 한나절이면 전체 코드를 읽을 수 있습니다.
 </p>
 
 <p align="center">
@@ -76,7 +76,7 @@ LLM 메모리의 병목은 모델 가중치가 아니라 **KV 캐시**입니다.
 |  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
 |:--|:---------:|:---------:|:----:|:---:|:-------:|
 | KV 압축 | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
-| 코드 크기 | **67K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
+| 코드 크기 | **55K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
 | 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
 | 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |
 | WASM | **192KB** | -- | -- | -- | -- |
@@ -278,7 +278,7 @@ python3 -m http.server 8080       # 로컬 서버 시작
 <details>
 <summary><b>llama.cpp와 뭐가 다른가요?</b></summary>
 
-llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 임베딩할 수 있는 미니멀 엔진 (67K LOC). 다른 문제를 위한 다른 도구입니다: llama.cpp는 속도를, quant.cpp는 메모리(KV 압축)와 임베더빌리티(단일 헤더)를 최적화합니다.
+llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 임베딩할 수 있는 미니멀 엔진 (55K LOC). 다른 문제를 위한 다른 도구입니다: llama.cpp는 속도를, quant.cpp는 메모리(KV 압축)와 임베더빌리티(단일 헤더)를 최적화합니다.
 
 </details>
 
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 
 <p align="center">
   Lossless KV cache compression. Also ships as <a href="#-single-header-mode"><b>quant.h</b></a> — a single-header library.<br>
-  67K LOC. Embeddable. Read it in an afternoon.
+  55K LOC. Embeddable. Read it in an afternoon.
 </p>
 
 <p align="center">
@@ -76,7 +76,7 @@ LLM memory is dominated by the **KV cache**, not model weights. At 32K context,
 |  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
 |:--|:---------:|:---------:|:----:|:---:|:-------:|
 | KV compression | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
-| Code size | **67K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
+| Code size | **55K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
 | Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
 | Embeddable | **single header** | -- | -- | -- | complex |
 | WASM | **192KB** | -- | -- | -- | -- |
@@ -278,7 +278,7 @@ Everything runs client-side. Nothing is uploaded. KV compression active by defau
 <details>
 <summary><b>How is this different from llama.cpp?</b></summary>
 
-llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (67K LOC) you can read, modify, and embed. Different tools for different problems: llama.cpp optimizes speed, quant.cpp optimizes memory (KV compression) and embeddability (single header).
+llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (55K LOC) you can read, modify, and embed. Different tools for different problems: llama.cpp optimizes speed, quant.cpp optimizes memory (KV compression) and embeddability (single header).
 
 </details>