docs: update README + guide with v0.12.0 ollama-style CLI examples (#46)

unamedkr · claude · web-flow · commit dac884fc1e56 · 2026-04-11T23:06:33.000+09:00
PR #43 added pull/list/run/serve subcommands to the quantcpp CLI (now on PyPI as v0.12.0). Update user-facing documentation to lead with the new commands instead of the old single-shot Python pattern. Changes: - README.md Quick Start: lead with `quantcpp pull/run/serve/list`, show short aliases (smollm2:135m, qwen3.5:0.8b, llama3.2:1b), keep Python API as the secondary path - README.ko.md: same restructure in Korean ("빠른 시작") - site/index.html (guide): CTA section now shows CLI commands and Python API side-by-side; new i18n keys cta.label.cli/python in both EN and KO dictionaries Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/README.ko.md b/README.ko.md
@@ -22,20 +22,33 @@
 
 ---
 
-## 3줄로 시작하기
+## 빠른 시작
 
+**Ollama 스타일 CLI (v0.12.0+):**
 ```bash
 pip install quantcpp
+
+quantcpp pull llama3.2:1b           # HuggingFace에서 다운로드
+quantcpp run llama3.2:1b            # 대화형 채팅
+quantcpp serve llama3.2:1b -p 8080  # OpenAI 호환 HTTP 서버
+quantcpp list                       # 캐시된 모델 목록
+```
+
+짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다.
+
+**한 줄 질문:**
+```bash
+quantcpp run llama3.2:1b "중력이란 무엇인가요?"
 ```
 
+**Python API (3줄):**
 ```python
 from quantcpp import Model
-
-m = Model.from_pretrained("Llama-3.2-1B")  # 모델 자동 다운로드 (~750 MB)
+m = Model.from_pretrained("Llama-3.2-1B")
 print(m.ask("중력이란 무엇인가요?"))
 ```
 
-API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
+API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
 
 ---
 
diff --git a/README.md b/README.md
@@ -37,27 +37,31 @@
 
 ## Quick Start
 
-**Terminal (one command):**
+**Ollama-style CLI (v0.12.0+):**
 ```bash
 pip install quantcpp
-quantcpp "What is gravity?"
+
+quantcpp pull llama3.2:1b           # download from HuggingFace
+quantcpp run llama3.2:1b            # interactive chat
+quantcpp serve llama3.2:1b -p 8080  # OpenAI-compatible HTTP server
+quantcpp list                       # show cached models
+```
+
+Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080.
+
+**One-shot question:**
+```bash
+quantcpp run llama3.2:1b "What is gravity?"
 ```
 
-**Python (3 lines):**
+**Python API (3 lines):**
 ```python
 from quantcpp import Model
 m = Model.from_pretrained("Llama-3.2-1B")
 print(m.ask("What is gravity?"))
 ```
 
-**Interactive chat:**
-```bash
-quantcpp
-# You: What is gravity?
-# AI: Gravity is a fundamental force...
-```
-
-Downloads Llama-3.2-1B (~750 MB) on first use, cached locally. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**How it works — Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
+Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
 
 ---
 
diff --git a/site/index.html b/site/index.html
@@ -727,12 +727,25 @@ <h2 class="reveal" data-i18n="glossary.title">Glossary</h2>
 <section class="cta" style="background:var(--bg2)">
   <div class="container reveal">
     <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
-    <p style="color:var(--text2);margin-bottom:2rem;max-width:500px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Three lines of Python. No GPU, no API key, no setup.</p>
-    <pre style="text-align:left;display:inline-block;margin-bottom:2rem"><code>pip install quantcpp
+    <p style="color:var(--text2);margin-bottom:2rem;max-width:560px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Ollama-style CLI. No GPU, no API key, no setup.</p>
+    <div style="display:flex;gap:1.5rem;flex-wrap:wrap;justify-content:center;margin-bottom:2rem;text-align:left">
+      <div>
+        <div style="font-size:.75rem;color:var(--text2);margin-bottom:.3rem;font-weight:600" data-i18n="cta.label.cli">CLI (v0.12.0+)</div>
+        <pre style="margin:0"><code>pip install quantcpp
+
+quantcpp pull llama3.2:1b
+quantcpp run llama3.2:1b
+quantcpp serve llama3.2:1b -p 8080
+quantcpp list</code></pre>
+      </div>
+      <div>
+        <div style="font-size:.75rem;color:var(--text2);margin-bottom:.3rem;font-weight:600" data-i18n="cta.label.python">Python API</div>
+        <pre style="margin:0"><code>from quantcpp import Model
 
-from quantcpp import Model
 m = Model.from_pretrained("Llama-3.2-1B")
 print(m.ask("What is gravity?"))</code></pre>
+      </div>
+    </div>
     <br>
     <a href="https://github.com/quantumaikr/quant.cpp" class="cta-btn cta-primary">GitHub</a>
     <a href="https://pypi.org/project/quantcpp/" class="cta-btn cta-secondary">PyPI</a>
@@ -896,7 +909,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
     "glossary.gguf.term": "GGUF",
     "glossary.gguf.def": "The standard file format for quantized LLM model weights, created by the llama.cpp project. quant.cpp loads GGUF models directly.",
     "cta.title": "Try It Yourself",
-    "cta.desc": "Three lines of Python. No GPU, no API key, no setup.",
+    "cta.desc": "Ollama-style CLI. No GPU, no API key, no setup.",
+    "cta.label.cli": "CLI (v0.12.0+)",
+    "cta.label.python": "Python API",
     "rag.label": "Movement",
     "rag.title": "Beyond RAG",
     "rag.intro": "Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. <strong>Now they have 128K. The compromise should have started disappearing.</strong>",
@@ -1083,7 +1098,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
     "glossary.gguf.term": "GGUF",
     "glossary.gguf.def": "\uC591\uC790\uD654\uB41C LLM \uBAA8\uB378 \uAC00\uC911\uCE58\uC758 \uD45C\uC900 \uD30C\uC77C \uD615\uC2DD. llama.cpp \uD504\uB85C\uC81D\uD2B8\uC5D0\uC11C \uB9CC\uB4E4\uC5C8\uC2B5\uB2C8\uB2E4. quant.cpp\uB294 GGUF \uBAA8\uB378\uC744 \uC9C1\uC811 \uB85C\uB4DC\uD569\uB2C8\uB2E4.",
     "cta.title": "\uC9C1\uC811 \uD574\uBCF4\uAE30",
-    "cta.desc": "Python 3\uC904. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
+    "cta.desc": "Ollama \uC2A4\uD0C0\uC77C CLI. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
+    "cta.label.cli": "CLI (v0.12.0+)",
+    "cta.label.python": "Python API",
     "rag.label": "운동",
     "rag.title": "Beyond RAG",
     "rag.intro": "전통적인 RAG는 문서를 512토큰 청크로 나누고, 벡터 DB에 임베딩하고, 조각을 검색합니다. 이것은 LLM이 2K 컨텍스트만 가졌을 때 합리적인 엔지니어링 타협이었습니다. <strong>지금은 128K입니다. 그 타협은 사라지기 시작했어야 합니다.</strong>",