You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FAQ: anticipate community questions for single-header launch (en + ko)
New FAQ entries addressing expected r/programming, HN, r/C_Programming reactions:
- "15K lines too big?" — compared to stb_image (7.8K) and sqlite3 (240K),
compile time 1.7s, binary 254KB
- "vs Karpathy's llm.c?" — same philosophy, but quant.h adds GGUF,
multi-arch, quantized weights, KV compression
- "No GPU = useless?" — CPU is fine for embedding in apps/WASM/IoT,
25 tok/s on 1.7B model
- "Windows?" — yes, WIN32 guards for mmap/threading
- "Where to get GGUF?" — HuggingFace link + recommended starter model
- "AI-generated?" — AI-assisted development, human-directed and verified
- "WASM?" — pure C11, Emscripten supported, demo planned
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`quant.h`는 핵심 추론 엔진 (15K LOC)을 단일 파일에 담은 것. Full Build (33K LOC)에는 GPU 백엔드 (Metal, CUDA, Vulkan), MoE 라우팅, 고급 양자화 타입, CLI 도구, 벤치마크가 추가됩니다. 임베딩에는 quant.h, 연구/개발에는 Full Build.
221
221
222
+
**15K줄 헤더 — 너무 크지 않나요?**
223
+
224
+
stb_image.h는 7.8K줄. sqlite3.c (amalgamation)는 240K줄. quant.h는 그 사이 15K — 헤더로는 크고, 추론 엔진으로는 작습니다. 컴파일 시간은 Apple M3에서 ~1.7초. 바이너리 254KB. 컴파일 시간이 신경 쓰이면 CMake Full Build를 사용하고 `libturboquant.a`에 링크하세요.
225
+
226
+
**Karpathy의 llm.c와 비교하면?**
227
+
228
+
비슷한 철학: 미니멀 C, 교육적, 의존성 없음. 핵심 차이: quant.h는 양자화 가중치 (Q4_K_M, Q8_0, IQ2)와 다중 아키텍처 (Llama, Qwen, Gemma)를 GGUF 로더로 지원합니다. llm.c는 단일 모델 + FP32 가중치. quant.h에는 KV 캐시 압축도 포함. llm.c가 교과서라면 quant.h는 같은 아이디어의 프로덕션 버전.
229
+
230
+
**GPU 없으면 쓸모없는 거 아닌가요?**
231
+
232
+
용도에 따라 다릅니다. 대형 모델에서 100+ tok/s가 필요하면 llama.cpp + Metal/CUDA를 쓰세요. iOS 앱, WASM 모듈, 게임 엔진, GPU API가 없는 IoT 디바이스에 추론을 임베딩해야 하면 — quant.h가 적합합니다. Apple Silicon CPU에서 1.7B 모델 기준 25 tok/s, 어시스턴트나 자동완성, 백그라운드 처리에 충분합니다.
233
+
234
+
**Windows에서 동작하나요?**
235
+
236
+
네. `#ifdef _WIN32` 가드로 mmap (`CreateFileMapping`/`MapViewOfFile`), 스레딩, 파일 I/O를 처리합니다. MSVC 또는 MinGW로 컴파일: `cl app.c /O2` 또는 `gcc app.c -o app -lm -lpthread`.
237
+
238
+
**GGUF 모델 파일은 어디서 받나요?**
239
+
240
+
[Hugging Face](https://huggingface.co/models?library=gguf)에서 아무 GGUF나 다운로드. 추천 입문 모델: [SmolLM2-1.7B-Instruct-Q8_0](https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF) (1.8GB). 변환 필요 없음 — GGUF 파일을 바로 사용합니다.
241
+
242
+
**AI가 만든 코드인가요?**
243
+
244
+
Claude Code를 AI 개발 도구로 사용하여 개발했습니다 (Copilot 사용과 동일). 아키텍처 결정, 알고리즘 선택, 버그 수정, 모든 PPL 측정은 사람이 주도하고 검증합니다. 33K줄의 C 코드 — 직접 읽어보시면 됩니다.
245
+
222
246
**3-bit 이하는요?**
223
247
224
248
광범위하게 테스트했습니다: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. 어떤 접근도 허용 가능한 품질을 달성하지 못했습니다. 근본 장벽: step당 코사인 0.997이 200 step 후 0.885로 누적. 3-bit + delta가 실용적 최소.
225
249
250
+
**브라우저(WASM)에서 돌아가나요?**
251
+
252
+
코드가 순수 C11이고 코어 경로에 플랫폼 의존성이 없습니다. Emscripten 컴파일을 지원합니다. 소형 모델로 브라우저 데모를 준비 중입니다.
Copy file name to clipboardExpand all lines: README.md
+28Lines changed: 28 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -213,10 +213,38 @@ Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global bu
213
213
214
214
`quant.h` is the core inference engine (15K LOC) in a single file. The full build (33K LOC) adds GPU backends (Metal, CUDA, Vulkan), MoE routing, advanced quantization types, CLI tools, and benchmarks. Use quant.h for embedding; use the full build for research and development.
215
215
216
+
**15K lines in a header — isn't that too big?**
217
+
218
+
stb_image.h is 7.8K lines. sqlite3.c (the amalgamation) is 240K lines. quant.h sits in between at 15K — large for a header, small for an inference engine. Compile time is ~1.7 seconds on Apple M3. Binary size is 254KB. If compile time is a concern, use the full CMake build and link against `libturboquant.a` instead.
219
+
220
+
**How does this compare to Karpathy's llm.c?**
221
+
222
+
Similar philosophy: minimal C, educational, no dependencies. Key differences: quant.h supports quantized weight formats (Q4_K_M, Q8_0, IQ2) and multiple architectures (Llama, Qwen, Gemma) via the GGUF loader. llm.c targets a single model with FP32 weights. quant.h also includes KV cache compression. Think of llm.c as the textbook and quant.h as the production-ready version of the same idea.
223
+
224
+
**No GPU — is this useless?**
225
+
226
+
Depends on your use case. If you need 100+ tok/s on large models, use llama.cpp with Metal/CUDA. If you need to embed inference in an iOS app, a WASM module, a game engine, or an IoT device where there is no GPU API — quant.h works. CPU inference on Apple Silicon gets 25 tok/s on a 1.7B model, which is fine for assistants, autocomplete, and background processing.
227
+
228
+
**Does it work on Windows?**
229
+
230
+
Yes. The header includes `#ifdef _WIN32` guards for mmap (uses `CreateFileMapping`/`MapViewOfFile`), threading, and file I/O. Compile with MSVC or MinGW: `cl app.c /O2` or `gcc app.c -o app -lm -lpthread`.
231
+
232
+
**How do I get a GGUF model file?**
233
+
234
+
Download any GGUF from [Hugging Face](https://huggingface.co/models?library=gguf). Recommended starter model: [SmolLM2-1.7B-Instruct-Q8_0](https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF) (1.8GB). No conversion needed — GGUF files work directly.
235
+
236
+
**Is this AI-generated code?**
237
+
238
+
Developed with Claude Code as an AI-assisted development tool, same way others use Copilot. The architecture, algorithm choices, bug fixes, and every PPL measurement are human-directed and verified. The code is 33K lines of C — feel free to read it.
239
+
216
240
**What about sub-3-bit quantization?**
217
241
218
242
Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
219
243
244
+
**Can it run in the browser (WASM)?**
245
+
246
+
The code is pure C11 with no platform-specific dependencies in the core path. Emscripten compilation is supported. A browser demo with a small model is on the roadmap.
0 commit comments