Skip to content

Commit 3d76424

Browse files
unamedkrclaude
andcommitted
FAQ: anticipate community questions for single-header launch (en + ko)
New FAQ entries addressing expected r/programming, HN, r/C_Programming reactions: - "15K lines too big?" — compared to stb_image (7.8K) and sqlite3 (240K), compile time 1.7s, binary 254KB - "vs Karpathy's llm.c?" — same philosophy, but quant.h adds GGUF, multi-arch, quantized weights, KV compression - "No GPU = useless?" — CPU is fine for embedding in apps/WASM/IoT, 25 tok/s on 1.7B model - "Windows?" — yes, WIN32 guards for mmap/threading - "Where to get GGUF?" — HuggingFace link + recommended starter model - "AI-generated?" — AI-assisted development, human-directed and verified - "WASM?" — pure C11, Emscripten supported, demo planned Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6095573 commit 3d76424

2 files changed

Lines changed: 56 additions & 0 deletions

File tree

README.ko.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,10 +219,38 @@ Linux, macOS, Windows, iOS, Android, WASM에서 동작합니다.
219219

220220
`quant.h`는 핵심 추론 엔진 (15K LOC)을 단일 파일에 담은 것. Full Build (33K LOC)에는 GPU 백엔드 (Metal, CUDA, Vulkan), MoE 라우팅, 고급 양자화 타입, CLI 도구, 벤치마크가 추가됩니다. 임베딩에는 quant.h, 연구/개발에는 Full Build.
221221

222+
**15K줄 헤더 — 너무 크지 않나요?**
223+
224+
stb_image.h는 7.8K줄. sqlite3.c (amalgamation)는 240K줄. quant.h는 그 사이 15K — 헤더로는 크고, 추론 엔진으로는 작습니다. 컴파일 시간은 Apple M3에서 ~1.7초. 바이너리 254KB. 컴파일 시간이 신경 쓰이면 CMake Full Build를 사용하고 `libturboquant.a`에 링크하세요.
225+
226+
**Karpathy의 llm.c와 비교하면?**
227+
228+
비슷한 철학: 미니멀 C, 교육적, 의존성 없음. 핵심 차이: quant.h는 양자화 가중치 (Q4_K_M, Q8_0, IQ2)와 다중 아키텍처 (Llama, Qwen, Gemma)를 GGUF 로더로 지원합니다. llm.c는 단일 모델 + FP32 가중치. quant.h에는 KV 캐시 압축도 포함. llm.c가 교과서라면 quant.h는 같은 아이디어의 프로덕션 버전.
229+
230+
**GPU 없으면 쓸모없는 거 아닌가요?**
231+
232+
용도에 따라 다릅니다. 대형 모델에서 100+ tok/s가 필요하면 llama.cpp + Metal/CUDA를 쓰세요. iOS 앱, WASM 모듈, 게임 엔진, GPU API가 없는 IoT 디바이스에 추론을 임베딩해야 하면 — quant.h가 적합합니다. Apple Silicon CPU에서 1.7B 모델 기준 25 tok/s, 어시스턴트나 자동완성, 백그라운드 처리에 충분합니다.
233+
234+
**Windows에서 동작하나요?**
235+
236+
네. `#ifdef _WIN32` 가드로 mmap (`CreateFileMapping`/`MapViewOfFile`), 스레딩, 파일 I/O를 처리합니다. MSVC 또는 MinGW로 컴파일: `cl app.c /O2` 또는 `gcc app.c -o app -lm -lpthread`.
237+
238+
**GGUF 모델 파일은 어디서 받나요?**
239+
240+
[Hugging Face](https://huggingface.co/models?library=gguf)에서 아무 GGUF나 다운로드. 추천 입문 모델: [SmolLM2-1.7B-Instruct-Q8_0](https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF) (1.8GB). 변환 필요 없음 — GGUF 파일을 바로 사용합니다.
241+
242+
**AI가 만든 코드인가요?**
243+
244+
Claude Code를 AI 개발 도구로 사용하여 개발했습니다 (Copilot 사용과 동일). 아키텍처 결정, 알고리즘 선택, 버그 수정, 모든 PPL 측정은 사람이 주도하고 검증합니다. 33K줄의 C 코드 — 직접 읽어보시면 됩니다.
245+
222246
**3-bit 이하는요?**
223247

224248
광범위하게 테스트했습니다: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. 어떤 접근도 허용 가능한 품질을 달성하지 못했습니다. 근본 장벽: step당 코사인 0.997이 200 step 후 0.885로 누적. 3-bit + delta가 실용적 최소.
225249

250+
**브라우저(WASM)에서 돌아가나요?**
251+
252+
코드가 순수 C11이고 코어 경로에 플랫폼 의존성이 없습니다. Emscripten 컴파일을 지원합니다. 소형 모델로 브라우저 데모를 준비 중입니다.
253+
226254
---
227255

228256
## 참고 논문

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,10 +213,38 @@ Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global bu
213213

214214
`quant.h` is the core inference engine (15K LOC) in a single file. The full build (33K LOC) adds GPU backends (Metal, CUDA, Vulkan), MoE routing, advanced quantization types, CLI tools, and benchmarks. Use quant.h for embedding; use the full build for research and development.
215215

216+
**15K lines in a header — isn't that too big?**
217+
218+
stb_image.h is 7.8K lines. sqlite3.c (the amalgamation) is 240K lines. quant.h sits in between at 15K — large for a header, small for an inference engine. Compile time is ~1.7 seconds on Apple M3. Binary size is 254KB. If compile time is a concern, use the full CMake build and link against `libturboquant.a` instead.
219+
220+
**How does this compare to Karpathy's llm.c?**
221+
222+
Similar philosophy: minimal C, educational, no dependencies. Key differences: quant.h supports quantized weight formats (Q4_K_M, Q8_0, IQ2) and multiple architectures (Llama, Qwen, Gemma) via the GGUF loader. llm.c targets a single model with FP32 weights. quant.h also includes KV cache compression. Think of llm.c as the textbook and quant.h as the production-ready version of the same idea.
223+
224+
**No GPU — is this useless?**
225+
226+
Depends on your use case. If you need 100+ tok/s on large models, use llama.cpp with Metal/CUDA. If you need to embed inference in an iOS app, a WASM module, a game engine, or an IoT device where there is no GPU API — quant.h works. CPU inference on Apple Silicon gets 25 tok/s on a 1.7B model, which is fine for assistants, autocomplete, and background processing.
227+
228+
**Does it work on Windows?**
229+
230+
Yes. The header includes `#ifdef _WIN32` guards for mmap (uses `CreateFileMapping`/`MapViewOfFile`), threading, and file I/O. Compile with MSVC or MinGW: `cl app.c /O2` or `gcc app.c -o app -lm -lpthread`.
231+
232+
**How do I get a GGUF model file?**
233+
234+
Download any GGUF from [Hugging Face](https://huggingface.co/models?library=gguf). Recommended starter model: [SmolLM2-1.7B-Instruct-Q8_0](https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF) (1.8GB). No conversion needed — GGUF files work directly.
235+
236+
**Is this AI-generated code?**
237+
238+
Developed with Claude Code as an AI-assisted development tool, same way others use Copilot. The architecture, algorithm choices, bug fixes, and every PPL measurement are human-directed and verified. The code is 33K lines of C — feel free to read it.
239+
216240
**What about sub-3-bit quantization?**
217241

218242
Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
219243

244+
**Can it run in the browser (WASM)?**
245+
246+
The code is pure C11 with no platform-specific dependencies in the core path. Emscripten compilation is supported. A browser demo with a small model is on the roadmap.
247+
220248
---
221249

222250
## References

0 commit comments

Comments
 (0)