Skip to content

Commit e0ae945

Browse files
unamedkrclaude
andauthored
docs: address "why not just use llama.cpp?" with concrete scenarios (#38)
Reddit feedback (Eyelbee, 5 upvotes): feature tables don't convince users who know llama.cpp. Added scenario-based comparison showing where the single-header approach matters in practice: - WASM: 192 KB vs GGML tensor graph too large - Microcontroller: #include only option (no FS, no linker) - Game engines: one .h vs 250K LOC build integration - Teaching: readable in an afternoon Includes side-by-side build commands (cc app.c -lm vs cmake + link). Explicitly recommends llama.cpp for GPU speed and model coverage. Applied to both EN and KO READMEs. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4cc5598 commit e0ae945

2 files changed

Lines changed: 64 additions & 18 deletions

File tree

README.ko.md

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -156,19 +156,36 @@ Transformer의 attention 메커니즘은 최근 토큰에 자연스럽게 집중
156156
</details>
157157

158158
<details>
159-
<summary><b>다른 엔진과의 비교</b></summary>
160-
161-
| | quant.cpp | llama.cpp | vLLM | MLX |
162-
|---|:---:|:---:|:---:|:---:|
163-
| KV 압축 | **7가지 방식** | Q8_0/Q5_0 (2x) |||
164-
| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ |
165-
| 임베딩 가능 | **단일 헤더** | 라이브러리 | 라이브러리 | 프레임워크 |
166-
| GPU 속도 | 기본 | 최적화됨 | **최고** | Metal |
167-
| 하루만에 코드 읽기 | **** ||||
168-
169-
**llama.cpp**를 쓰세요 — 워크스테이션에서 최고 속도가 필요할 때.
170-
**vLLM**을 쓰세요 — 배치 서빙이 필요할 때.
171-
**quant.cpp**를 쓰세요 — AI를 앱/게임/웹/디바이스에 **넣어야** 할 때.
159+
<summary><b>"llama.cpp로도 임베딩 되는데, 왜 quant.cpp?"</b></summary>
160+
161+
맞습니다. llama.cpp는 훌륭하고 임베딩도 가능합니다. 차이는 **통합 방식**입니다:
162+
163+
**llama.cpp = 컴파일된 라이브러리** (250K+ LOC). `libllama`를 링크하면 GGML 텐서 그래프, Metal/CUDA 백엔드, 샘플러, 토크나이저가 따라옵니다. 빌드 시스템이 이를 감당할 수 있다면 훌륭합니다 — 하지만 빌드 단계가 필요한 _라이브러리_입니다.
164+
165+
**quant.cpp = 파일 하나** (16K LOC). `#include "quant.h"`, `cc app.c -lm`으로 컴파일. CMake 없음, 링커 플래그는 libc뿐. 하나의 번역 단위.
166+
167+
```
168+
# quant.cpp — C 프로젝트에 AI 추가: 2줄
169+
cc -O2 my_app.c -lm -lpthread -o my_app # 끝
170+
171+
# llama.cpp — 먼저 라이브러리를 빌드해야 합니다
172+
cmake -B build && cmake --build build
173+
cc my_app.c -Ibuild/include -Lbuild -lllama -lm -lstdc++ -o my_app
174+
```
175+
176+
| 시나리오 | quant.cpp | llama.cpp |
177+
|:---------|:---------:|:---------:|
178+
| **WASM 브라우저** | 192 KB 바이너리 | GGML 텐서 그래프가 너무 큼 |
179+
| **마이크로컨트롤러 / RTOS** | `#include`만 가능 (FS/링커 없음) | 빌드 시스템 필요 |
180+
| **게임 엔진** (Unity/Unreal/Godot) | `.h` 파일 하나 드롭 | 250K LOC 빌드 통합 |
181+
| **교육 / 연구** | 하루만에 전체 코드 읽기 가능 | 훌륭하지만 코드가 방대 |
182+
| **GPU 속도** | 기본 | **Metal/CUDA 최적화** |
183+
| **모델 지원** | 7개 아키텍처 | **100+** |
184+
185+
> **llama.cpp** — 워크스테이션에서 최고 속도가 필요할 때.
186+
> **vLLM** — 배치 서빙이 필요할 때.
187+
> **quant.cpp** — AI를 앱/게임/브라우저/디바이스 _안에_ 넣어야 할 때, 통합 단순성이 GPU 처리량보다 중요할 때.
188+
172189
</details>
173190

174191
<details>

README.md

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -310,19 +310,48 @@ Both are per-block methods. The quality gap comes from block size (128 vs 32), m
310310
| GGUF model loading | **✅ 7 architectures** ||| research only |
311311
| End-to-end inference | **** | kernel only | kernel only | kernel only |
312312

313+
### "Why not just use llama.cpp?"
314+
315+
You absolutely can. llama.cpp is excellent. The difference is **integration scope**, not capability:
316+
317+
**llama.cpp = compiled library** (250K+ LOC). You link `libllama`, which pulls in GGML tensor graphs, Metal/CUDA backends, sampling, tokenizer. Great if your build system handles it — but it's a _library_ with a build step.
318+
319+
**quant.cpp = one file** (16K LOC). `#include "quant.h"`, compile with `cc app.c -lm`. No CMake, no linker flags beyond libc. One translation unit.
320+
321+
Where this difference matters in practice:
322+
323+
```
324+
# quant.cpp — add LLM to any C project in 2 lines
325+
cc -O2 my_app.c -lm -lpthread -o my_app # that's it
326+
327+
# llama.cpp — requires building the library first
328+
cmake -B build && cmake --build build # build libllama
329+
cc my_app.c -Ibuild/include -Lbuild -lllama -lm -lstdc++ -o my_app
330+
```
331+
332+
| Scenario | quant.cpp | llama.cpp |
333+
|:---------|:---------:|:---------:|
334+
| **WASM browser demo** | 192 KB binary | GGML tensor graph too large |
335+
| **Microcontroller / RTOS** | `#include` only option (no FS, no linker) | Needs build system |
336+
| **Game engine plugin** (Unity/Unreal/Godot) | Drop one `.h` | Integrate 250K LOC build |
337+
| **Teaching / research** | Read in an afternoon | Excellent but large codebase |
338+
| **Quick prototype** | `pip install quantcpp` or 2-line C | More setup needed |
339+
| **GPU speed** | Basic | **Full Metal/CUDA** |
340+
| **Model coverage** | 7 architectures | **100+** |
341+
| **Production hardening** | Early stage | **Battle-tested** |
342+
343+
> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
344+
> **Use quant.cpp** when you need to ship LLM inference _inside_ something — an app, a game, a browser tab, an embedded device — and integration simplicity matters more than GPU throughput.
345+
313346
### vs production inference engines
314347

315348
| | quant.cpp | llama.cpp | vLLM | MLX |
316349
|:--|:---------:|:---------:|:----:|:---:|
317-
| KV quantization | **TurboQuant + 6 schemes** | Q8_0/Q5_0 (2x) | -- | -- |
350+
| KV quantization | **7 schemes (3-7x)** | Q8_0/Q5_0 (2x) | -- | -- |
318351
| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ |
319352
| Embeddable | **single header** | library | library | framework |
320-
| Read in an afternoon | **** ||||
321353
| GPU throughput | basic | full | **best** | Metal |
322354

323-
> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
324-
> **Use quant.cpp** when you need to ship LLM inference inside something — an app, a game, a website, a device.
325-
326355
---
327356

328357
## Supported Models

0 commit comments

Comments
 (0)