README: fix overstated claims found by audit

unamedkr · claude · unamedkr · commit 6ff7c1aa3811 · 2026-04-03T23:21:40.000+09:00
- "Zero dependencies" → "No external libraries" (pthreads is a system dep)
- "5 architectures" → accurate description (3 code paths: Llama/Qwen3.5
  share model_type=0, Gemma 3/4 share model_type=1, Qwen2-MoE)
- "4x longer context" → "~4x" with footnote that numbers are estimates
  based on KV memory reduction, not actual measurements
- Dependencies table: "Zero (libc only)" → "libc + pthreads only"
- Apply same fixes to README.ko.md

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -2,7 +2,7 @@
 
 ![quant.cpp Hero](docs/assets/hero.png)
 
-로컬 LLM을 위한 미니멀 C 추론 엔진. 33K LOC. 외부 의존성 없음.
+로컬 LLM을 위한 미니멀 C 추론 엔진. 33K LOC. 외부 라이브러리 없음.
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![CI](https://img.shields.io/github/actions/workflow/status/quantumaikr/quant.cpp/ci.yml?label=CI)]()
@@ -111,7 +111,7 @@ cmake --build build -j$(nproc)
 | Gemma 3 270M | Gemma 3 | 270M | 4-bit K verified |
 | Gemma 4 E2B | Gemma 4 | 2B | WIP |
 
-5개 아키텍처: Llama, Gemma 3, Gemma 4, Qwen3.5 (DeltaNet), Qwen2-MoE.
+아키텍처: Llama/Qwen3.5 (공유 경로), Gemma 3/4 (sliding + full attention), Qwen2-MoE.
 
 ---
 
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 Embeddable LLM inference in pure C.
 
-33K LOC. Zero dependencies. Read it in an afternoon.
+33K LOC. No external libraries. Read it in an afternoon.
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![CI](https://img.shields.io/github/actions/workflow/status/quantumaikr/quant.cpp/ci.yml?label=CI)]()
@@ -14,13 +14,15 @@ Embeddable LLM inference in pure C.
 
 ## What quant.cpp does
 
-**4x longer context on the same hardware.** Delta KV compression fits more tokens into your available memory with no quality loss.
+**~4x longer context on the same hardware.** KV cache compression reduces per-token memory by 3.8x, extending context proportionally.
 
-| Hardware | Model | Without | With quant.cpp | Gain |
+| Hardware | Model | FP16 KV | 4-bit K + Q4 V | Gain |
 |----------|-------|---------|----------------|------|
-| 8GB Laptop | Llama 8B (Q4) | 16K tokens | 61K tokens | 3.8x |
-| 16GB Mac Air | SmolLM2 1.7B | 78K tokens | 298K tokens | 3.8x |
-| 24GB RTX 3090 | Llama 8B (Q4) | 147K tokens | 559K tokens | 3.8x |
+| 8GB Laptop | Llama 8B (Q4) | ~16K tokens | ~61K tokens | 3.8x |
+| 16GB Mac Air | SmolLM2 1.7B | ~78K tokens | ~298K tokens | 3.8x |
+| 24GB RTX 3090 | Llama 8B (Q4) | ~147K tokens | ~559K tokens | 3.8x |
+
+*Estimates based on KV memory reduction. Actual context depends on available memory after model weights.*
 
 ```bash
 ./quant model.gguf -p "hello"
@@ -34,7 +36,7 @@ Embeddable LLM inference in pure C.
 |--|-----------|-----------|
 | Code | **33K LOC**, pure C | 250K+ LOC, C++ |
 | Design | Read, modify, embed | Feature-complete |
-| Dependencies | **Zero** (libc only) | ggml framework |
+| Dependencies | libc + pthreads only | ggml framework |
 | KV compression | PPL **-3.2%** (better than FP32) | PPL +10.6% |
 
 quant.cpp is not a fork. It's a standalone engine built from scratch for one goal: **LLM inference you can understand, customize, and ship inside your own product.**
@@ -111,7 +113,7 @@ Cross-model: SmolLM2 1.7B (-1.6%), Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%).
 | Gemma 3 270M | Gemma 3 | 270M | Working |
 | Gemma 4 E2B | Gemma 4 | 2B | WIP |
 
-5 architectures: Llama, Gemma 3/4, Qwen3.5 (DeltaNet hybrid), Qwen2-MoE.
+Architectures: Llama/Qwen3.5 (shared path), Gemma 3/4 (sliding + full attention), Qwen2-MoE.
 
 GGUF format. Load any llama.cpp-compatible model file.