Skip to content

Commit 15e4700

Browse files
unamedkrclaude
andcommitted
README: expand single-header docs with full API table and FAQ
- Added callback example (on_token function) - API table with all 6 functions - Config options with code example - FAQ: quant.h vs full build difference - FAQ: two embedding options (single-header vs library) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d8ec1f6 commit 15e4700

1 file changed

Lines changed: 43 additions & 4 deletions

File tree

README.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
![quant.cpp Hero](docs/assets/hero.png)
44

5-
Embeddable LLM inference in pure C.
5+
Embeddable LLM inference in pure C. Also ships as [**quant.h**](#single-header-mode) — a single-header library.
66

77
33K LOC. No external libraries. Read it in an afternoon.
88

@@ -54,11 +54,18 @@ Copy one file. Add LLM to any C project.
5454
```c
5555
#define QUANT_IMPLEMENTATION
5656
#include "quant.h"
57+
#include <stdio.h>
58+
59+
static void on_token(const char* text, void* ud) {
60+
(void)ud;
61+
printf("%s", text);
62+
fflush(stdout);
63+
}
5764

5865
int main() {
5966
quant_model* m = quant_load("model.gguf");
6067
quant_ctx* c = quant_new(m, NULL);
61-
quant_generate(c, "Hello!", print_token, NULL);
68+
quant_generate(c, "Hello!", on_token, NULL);
6269
quant_free_ctx(c);
6370
quant_free_model(m);
6471
}
@@ -68,7 +75,31 @@ int main() {
6875
cc app.c -o app -lm -lpthread # that's it
6976
```
7077

71-
15K lines, 628KB. No cmake, no build system, no dependencies.
78+
15K lines, 628KB. No cmake, no build system, no framework.
79+
80+
**Full API** (6 functions):
81+
82+
| Function | Description |
83+
|----------|-------------|
84+
| `quant_load(path)` | Load a GGUF model file |
85+
| `quant_new(model, config)` | Create inference context (config=NULL for defaults) |
86+
| `quant_generate(ctx, prompt, callback, userdata)` | Stream tokens via callback |
87+
| `quant_ask(ctx, prompt)` | Generate and return full string (caller frees) |
88+
| `quant_free_ctx(ctx)` | Free context |
89+
| `quant_free_model(model)` | Free model |
90+
91+
**Config options:**
92+
93+
```c
94+
quant_config cfg = {
95+
.temperature = 0.7f, // sampling temperature
96+
.top_p = 0.9f, // nucleus sampling
97+
.max_tokens = 256, // generation limit
98+
.n_threads = 4, // matmul threads
99+
.kv_compress = 1, // 0=off, 1=4-bit K+V, 2=delta+3-bit
100+
};
101+
quant_ctx* c = quant_new(model, &cfg);
102+
```
72103

73104
---
74105

@@ -164,7 +195,15 @@ llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a min
164195

165196
**Can I embed this in my app?**
166197

167-
Yes. Pure C11, zero dependencies. Copy the source files, link against libc/libm, and call `tq_load_model()` / `tq_generate()`. Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
198+
Yes. Two options:
199+
1. **Single-header** (easiest): Copy `quant.h` into your project. `#define QUANT_IMPLEMENTATION` in one .c file. Done.
200+
2. **Full library**: Link against `libturboquant.a` and call `tq_load_model()` / `tq_generate()`.
201+
202+
Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
203+
204+
**What's the difference between quant.h and the full build?**
205+
206+
`quant.h` is the core inference engine (15K LOC) in a single file. The full build (33K LOC) adds GPU backends (Metal, CUDA, Vulkan), MoE routing, advanced quantization types, CLI tools, and benchmarks. Use quant.h for embedding; use the full build for research and development.
168207

169208
**What about sub-3-bit quantization?**
170209

0 commit comments

Comments
 (0)