Skip to content

Commit fc2640e

Browse files
unamedkrMohamed Chorfaclaude
committed
Salvage non-conflicting parts of PR #13
Cherry-picked from MChorfa's #13 (rejecting the bulk reformatting and quant.h regen, accepting the structural improvements): - examples/README.md: comprehensive embedding examples doc (205 lines) - CMakeLists.txt: TQ_BUILD_EXAMPLES option, separate single-header examples (embed_*, single_header_example) so they link only against libm + Threads — proves quant.h truly stands alone. - ROADMAP.md: mark Direction 1 items complete (api docs, quant.h sync, embedding examples) Co-Authored-By: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 749b7d4 commit fc2640e

3 files changed

Lines changed: 238 additions & 13 deletions

File tree

CMakeLists.txt

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ option(TQ_BUILD_METAL "Build Metal backend" OFF)
1111
option(TQ_BUILD_VULKAN "Build Vulkan backend" OFF)
1212
option(TQ_BUILD_ROCM "Build ROCm/HIP backend" OFF)
1313
option(TQ_BUILD_SERVER "Build OpenAI-compatible HTTP server" OFF)
14+
option(TQ_BUILD_EXAMPLES "Build examples" ON)
1415

1516
# Threads (pthread)
1617
find_package(Threads REQUIRED)
@@ -276,14 +277,35 @@ if(NOT MSVC)
276277
target_link_libraries(tq_convert turboquant)
277278
endif()
278279

279-
# Examples (always built)
280-
file(GLOB EXAMPLE_C_SOURCES examples/*.c)
281-
file(GLOB EXAMPLE_CXX_SOURCES examples/*.cpp)
282-
foreach(ex_src ${EXAMPLE_C_SOURCES} ${EXAMPLE_CXX_SOURCES})
283-
get_filename_component(ex_name ${ex_src} NAME_WE)
284-
add_executable(${ex_name} ${ex_src})
285-
target_link_libraries(${ex_name} turboquant)
286-
endforeach()
280+
# Examples
281+
if(TQ_BUILD_EXAMPLES)
282+
file(GLOB EXAMPLE_C_SOURCES examples/*.c)
283+
file(GLOB EXAMPLE_CXX_SOURCES examples/*.cpp)
284+
foreach(ex_src ${EXAMPLE_C_SOURCES} ${EXAMPLE_CXX_SOURCES})
285+
get_filename_component(ex_name ${ex_src} NAME_WE)
286+
# Skip single-header examples (built separately below — they use quant.h, not turboquant)
287+
if(ex_name MATCHES "^embed_" OR ex_name STREQUAL "single_header_example")
288+
continue()
289+
endif()
290+
add_executable(${ex_name} ${ex_src})
291+
target_link_libraries(${ex_name} turboquant)
292+
endforeach()
293+
294+
# Single-header examples (use quant.h directly — link only libm + threads)
295+
if(NOT MSVC)
296+
add_executable(embed_minimal examples/embed_minimal.c)
297+
target_link_libraries(embed_minimal m Threads::Threads)
298+
299+
add_executable(embed_chat examples/embed_chat.c)
300+
target_link_libraries(embed_chat m Threads::Threads)
301+
302+
add_executable(embed_kv_compare examples/embed_kv_compare.c)
303+
target_link_libraries(embed_kv_compare m Threads::Threads)
304+
305+
add_executable(single_header_example examples/single_header_example.c)
306+
target_link_libraries(single_header_example m Threads::Threads)
307+
endif()
308+
endif()
287309

288310
# OpenAI-compatible HTTP server (POSIX only — uses sys/socket.h)
289311
if(TQ_BUILD_SERVER AND NOT MSVC)

ROADMAP.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,9 @@ The world's simplest way to add LLM to a C/C++ project.
2727
- [x] WASM build (192KB binary)
2828
- [x] MSVC/MinGW Windows support
2929
- [x] Zero external dependencies
30-
31-
### In Progress
32-
- [ ] API documentation (docs/api.md)
33-
- [ ] quant.h sync with latest source
34-
- [ ] Embedding examples (minimal, chat, KV compare)
30+
- [x] API documentation (docs/api.md)
31+
- [x] quant.h sync with latest source
32+
- [x] Embedding examples (minimal, chat, KV compare)
3533

3634
### Planned
3735
- [ ] pip install quantcpp (Python bindings)

examples/README.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# quant.cpp Embedding Examples
2+
3+
This directory contains examples demonstrating how to embed quant.cpp (the SQLite of LLM inference) into your C/C++ projects.
4+
5+
## Quick Start
6+
7+
The simplest way to use quant.cpp is with the single-header `quant.h`. No build system required:
8+
9+
```bash
10+
cc -O2 -o chat embed_chat.c -lm -lpthread
11+
./chat model.gguf
12+
```
13+
14+
## Examples
15+
16+
### embed_minimal.c
17+
**The smallest possible LLM integration (~60 lines)**
18+
19+
Demonstrates the 6-function API:
20+
- `quant_load()` - Load GGUF model
21+
- `quant_new()` - Create inference context
22+
- `quant_generate()` - Stream tokens via callback
23+
- `quant_free_ctx()` / `quant_free_model()` - Cleanup
24+
25+
**Compile:**
26+
```bash
27+
cc -O2 embed_minimal.c -o minimal -lm -lpthread
28+
```
29+
30+
**Run:**
31+
```bash
32+
./minimal smollm2-135m.gguf "Tell me a joke"
33+
```
34+
35+
**Features:**
36+
- Zero dependencies (libc + pthreads)
37+
- Memory-mapped model loading
38+
- KV cache compression enabled (7x longer context on same hardware)
39+
- Streaming token output
40+
41+
---
42+
43+
### embed_chat.c
44+
**Interactive chat application (~60 lines)**
45+
46+
A complete REPL (Read-Eval-Print Loop) for conversational AI.
47+
48+
**Compile:**
49+
```bash
50+
cc -O2 embed_chat.c -o chat -lm -lpthread
51+
```
52+
53+
**Run:**
54+
```bash
55+
./chat model.gguf
56+
```
57+
58+
**Features:**
59+
- Interactive prompt loop
60+
- Fresh context per turn (no conversation memory)
61+
- Ctrl+C to exit
62+
- Streaming output
63+
64+
**Usage:**
65+
```
66+
Model loaded. Type your message (Ctrl+C to exit).
67+
68+
> Hello!
69+
[AI response streaming...]
70+
```
71+
72+
---
73+
74+
### embed_kv_compare.c
75+
**KV compression quality comparison (~60 lines)**
76+
77+
Runs the same prompt with different KV compression levels to demonstrate quality vs. memory trade-offs.
78+
79+
**Compile:**
80+
```bash
81+
cc -O2 embed_kv_compare.c -o kv_compare -lm -lpthread
82+
```
83+
84+
**Run:**
85+
```bash
86+
./kv_compare model.gguf
87+
```
88+
89+
**Output:**
90+
```
91+
Prompt: What is the capital of France?
92+
==========================================
93+
94+
[FP32 (no compression)]
95+
Output: Paris
96+
97+
[4-bit K + Q4 V]
98+
Output: Paris
99+
100+
[Delta 3-bit + Q4 V]
101+
Output: Paris
102+
```
103+
104+
**Compression Levels:**
105+
- `kv_compress=0` - FP32 KV cache (no compression, highest quality)
106+
- `kv_compress=1` - 4-bit K + Q4 V (7x compression, PPL +0.0%)
107+
- `kv_compress=2` - Delta 3-bit + Q4 V (aggressive compression)
108+
109+
---
110+
111+
### single_header_example.c
112+
**Minimal single-header example (~40 lines)**
113+
114+
The absolute minimum code needed to run inference.
115+
116+
**Compile:**
117+
```bash
118+
cc -O2 single_header_example.c -o example -lm -lpthread
119+
```
120+
121+
**Run:**
122+
```bash
123+
./example model.gguf "Hello, world!"
124+
```
125+
126+
---
127+
128+
## Platform Support
129+
130+
All examples work on:
131+
- **macOS** (Apple Silicon, Intel)
132+
- **Linux** (x86_64, ARM64)
133+
- **Windows** (MSVC, MinGW)
134+
- **WebAssembly** (via Emscripten)
135+
- **iOS** (Xcode toolchain)
136+
- **Android** (NDK)
137+
138+
No external dependencies required beyond libc and pthreads.
139+
140+
## quant.h API Reference
141+
142+
### Configuration
143+
144+
```c
145+
typedef struct {
146+
float temperature; // Sampling temperature (default: 0.7)
147+
float top_p; // Nucleus sampling (default: 0.9)
148+
int max_tokens; // Max tokens to generate (default: 256)
149+
int n_threads; // Thread count for matmul (default: 4)
150+
int kv_compress; // 0=off, 1=4-bit, 2=delta+3-bit (default: 1)
151+
} quant_config;
152+
```
153+
154+
### Functions
155+
156+
| Function | Description |
157+
|----------|-------------|
158+
| `quant_load(path)` | Load GGUF model from disk |
159+
| `quant_new(model, config)` | Create inference context |
160+
| `quant_generate(ctx, prompt, cb, ud)` | Stream tokens via callback |
161+
| `quant_ask(ctx, prompt)` | Return full response (caller frees) |
162+
| `quant_free_ctx(ctx)` | Free context |
163+
| `quant_free_model(model)` | Free model |
164+
| `quant_version()` | Get version string |
165+
166+
## Memory Requirements
167+
168+
- **Model loading**: Memory-mapped, minimal RAM overhead
169+
- **Inference context**: ~2-4GB for 7B models (depends on sequence length)
170+
- **KV cache compression**: 7x reduction vs FP32 baseline
171+
172+
## Performance Tips
173+
174+
1. **Use KV compression** (`kv_compress=1` or `2`) for 7x longer context
175+
2. **Adjust thread count** (`n_threads`) based on CPU cores
176+
3. **Lower temperature** (0.0-0.3) for factual responses
177+
4. **Higher temperature** (0.7-1.0) for creative writing
178+
179+
## Troubleshooting
180+
181+
**"Failed to load model"**
182+
- Check model path is correct
183+
- Verify model is in GGUF format (llama.cpp compatible)
184+
- Ensure sufficient disk space for mmap
185+
186+
**Slow generation**
187+
- Increase `n_threads` (up to CPU core count)
188+
- Enable KV compression if not already
189+
- Try smaller models for faster inference
190+
191+
**Poor quality output**
192+
- Adjust `temperature` and `top_p` parameters
193+
- Try different KV compression levels
194+
- Ensure model is appropriate for your use case
195+
196+
## Next Steps
197+
198+
- See `docs/api.md` for full API documentation
199+
- Check `examples/` for advanced usage patterns
200+
- Read `README.md` for project overview
201+
- Visit https://github.com/quantumaikr/quant.cpp for more information
202+
203+
## License
204+
205+
Apache 2.0 - See LICENSE file for details.

0 commit comments

Comments
 (0)