You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README: expand single-header docs with full API table and FAQ
- Added callback example (on_token function)
- API table with all 6 functions
- Config options with code example
- FAQ: quant.h vs full build difference
- FAQ: two embedding options (single-header vs library)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -164,7 +195,15 @@ llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a min
164
195
165
196
**Can I embed this in my app?**
166
197
167
-
Yes. Pure C11, zero dependencies. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
198
+
Yes. Two options:
199
+
1.**Single-header** (easiest): Copy `quant.h` into your project. `#define QUANT_IMPLEMENTATION` in one .c file. Done.
200
+
2.**Full library**: Link against `libturboquant.a` and call `tq_load_model()` / `tq_generate()`.
201
+
202
+
Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
203
+
204
+
**What's the difference between quant.h and the full build?**
205
+
206
+
`quant.h` is the core inference engine (15K LOC) in a single file. The full build (33K LOC) adds GPU backends (Metal, CUDA, Vulkan), MoE routing, advanced quantization types, CLI tools, and benchmarks. Use quant.h for embedding; use the full build for research and development.
0 commit comments