You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #43 added pull/list/run/serve subcommands to the quantcpp CLI
(now on PyPI as v0.12.0). Update user-facing documentation to lead
with the new commands instead of the old single-shot Python pattern.
Changes:
- README.md Quick Start: lead with `quantcpp pull/run/serve/list`,
show short aliases (smollm2:135m, qwen3.5:0.8b, llama3.2:1b),
keep Python API as the secondary path
- README.ko.md: same restructure in Korean ("빠른 시작")
- site/index.html (guide): CTA section now shows CLI commands and
Python API side-by-side; new i18n keys cta.label.cli/python in
both EN and KO dictionaries
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.ko.md
+17-4Lines changed: 17 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,20 +22,33 @@
22
22
23
23
---
24
24
25
-
## 3줄로 시작하기
25
+
## 빠른 시작
26
26
27
+
**Ollama 스타일 CLI (v0.12.0+):**
27
28
```bash
28
29
pip install quantcpp
30
+
31
+
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32
+
quantcpp run llama3.2:1b # 대화형 채팅
33
+
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버
34
+
quantcpp list # 캐시된 모델 목록
35
+
```
36
+
37
+
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다.
38
+
39
+
**한 줄 질문:**
40
+
```bash
41
+
quantcpp run llama3.2:1b "중력이란 무엇인가요?"
29
42
```
30
43
44
+
**Python API (3줄):**
31
45
```python
32
46
from quantcpp import Model
33
-
34
-
m = Model.from_pretrained("Llama-3.2-1B") # 모델 자동 다운로드 (~750 MB)
47
+
m = Model.from_pretrained("Llama-3.2-1B")
35
48
print(m.ask("중력이란 무엇인가요?"))
36
49
```
37
50
38
-
API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
51
+
API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
Copy file name to clipboardExpand all lines: README.md
+15-11Lines changed: 15 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,27 +37,31 @@
37
37
38
38
## Quick Start
39
39
40
-
**Terminal (one command):**
40
+
**Ollama-style CLI (v0.12.0+):**
41
41
```bash
42
42
pip install quantcpp
43
-
quantcpp "What is gravity?"
43
+
44
+
quantcpp pull llama3.2:1b # download from HuggingFace
45
+
quantcpp run llama3.2:1b # interactive chat
46
+
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server
47
+
quantcpp list # show cached models
48
+
```
49
+
50
+
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080.
51
+
52
+
**One-shot question:**
53
+
```bash
54
+
quantcpp run llama3.2:1b "What is gravity?"
44
55
```
45
56
46
-
**Python (3 lines):**
57
+
**Python API (3 lines):**
47
58
```python
48
59
from quantcpp import Model
49
60
m = Model.from_pretrained("Llama-3.2-1B")
50
61
print(m.ask("What is gravity?"))
51
62
```
52
63
53
-
**Interactive chat:**
54
-
```bash
55
-
quantcpp
56
-
# You: What is gravity?
57
-
# AI: Gravity is a fundamental force...
58
-
```
59
-
60
-
Downloads Llama-3.2-1B (~750 MB) on first use, cached locally. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**How it works — Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
64
+
Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
<h2style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
730
-
<pstyle="color:var(--text2);margin-bottom:2rem;max-width:500px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Three lines of Python. No GPU, no API key, no setup.</p>
<pstyle="color:var(--text2);margin-bottom:2rem;max-width:560px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Ollama-style CLI. No GPU, no API key, no setup.</p>
@@ -896,7 +909,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
896
909
"glossary.gguf.term": "GGUF",
897
910
"glossary.gguf.def": "The standard file format for quantized LLM model weights, created by the llama.cpp project. quant.cpp loads GGUF models directly.",
898
911
"cta.title": "Try It Yourself",
899
-
"cta.desc": "Three lines of Python. No GPU, no API key, no setup.",
912
+
"cta.desc": "Ollama-style CLI. No GPU, no API key, no setup.",
913
+
"cta.label.cli": "CLI (v0.12.0+)",
914
+
"cta.label.python": "Python API",
900
915
"rag.label": "Movement",
901
916
"rag.title": "Beyond RAG",
902
917
"rag.intro": "Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. <strong>Now they have 128K. The compromise should have started disappearing.</strong>",
@@ -1083,7 +1098,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
"cta.desc": "Python 3\uC904. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
1101
+
"cta.desc": "Ollama \uC2A4\uD0C0\uC77C CLI. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
1102
+
"cta.label.cli": "CLI (v0.12.0+)",
1103
+
"cta.label.python": "Python API",
1087
1104
"rag.label": "운동",
1088
1105
"rag.title": "Beyond RAG",
1089
1106
"rag.intro": "전통적인 RAG는 문서를 512토큰 청크로 나누고, 벡터 DB에 임베딩하고, 조각을 검색합니다. 이것은 LLM이 2K 컨텍스트만 가졌을 때 합리적인 엔지니어링 타협이었습니다. <strong>지금은 128K입니다. 그 타협은 사라지기 시작했어야 합니다.</strong>",
0 commit comments