|
| 1 | +# quant.cpp Embedding Examples |
| 2 | + |
| 3 | +This directory contains examples demonstrating how to embed quant.cpp (the SQLite of LLM inference) into your C/C++ projects. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +The simplest way to use quant.cpp is with the single-header `quant.h`. No build system required: |
| 8 | + |
| 9 | +```bash |
| 10 | +cc -O2 -o chat embed_chat.c -lm -lpthread |
| 11 | +./chat model.gguf |
| 12 | +``` |
| 13 | + |
| 14 | +## Examples |
| 15 | + |
| 16 | +### embed_minimal.c |
| 17 | +**The smallest possible LLM integration (~60 lines)** |
| 18 | + |
| 19 | +Demonstrates the 6-function API: |
| 20 | +- `quant_load()` - Load GGUF model |
| 21 | +- `quant_new()` - Create inference context |
| 22 | +- `quant_generate()` - Stream tokens via callback |
| 23 | +- `quant_free_ctx()` / `quant_free_model()` - Cleanup |
| 24 | + |
| 25 | +**Compile:** |
| 26 | +```bash |
| 27 | +cc -O2 embed_minimal.c -o minimal -lm -lpthread |
| 28 | +``` |
| 29 | + |
| 30 | +**Run:** |
| 31 | +```bash |
| 32 | +./minimal smollm2-135m.gguf "Tell me a joke" |
| 33 | +``` |
| 34 | + |
| 35 | +**Features:** |
| 36 | +- Zero dependencies (libc + pthreads) |
| 37 | +- Memory-mapped model loading |
| 38 | +- KV cache compression enabled (7x longer context on same hardware) |
| 39 | +- Streaming token output |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +### embed_chat.c |
| 44 | +**Interactive chat application (~60 lines)** |
| 45 | + |
| 46 | +A complete REPL (Read-Eval-Print Loop) for conversational AI. |
| 47 | + |
| 48 | +**Compile:** |
| 49 | +```bash |
| 50 | +cc -O2 embed_chat.c -o chat -lm -lpthread |
| 51 | +``` |
| 52 | + |
| 53 | +**Run:** |
| 54 | +```bash |
| 55 | +./chat model.gguf |
| 56 | +``` |
| 57 | + |
| 58 | +**Features:** |
| 59 | +- Interactive prompt loop |
| 60 | +- Fresh context per turn (no conversation memory) |
| 61 | +- Ctrl+C to exit |
| 62 | +- Streaming output |
| 63 | + |
| 64 | +**Usage:** |
| 65 | +``` |
| 66 | +Model loaded. Type your message (Ctrl+C to exit). |
| 67 | +
|
| 68 | +> Hello! |
| 69 | +[AI response streaming...] |
| 70 | +``` |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +### embed_kv_compare.c |
| 75 | +**KV compression quality comparison (~60 lines)** |
| 76 | + |
| 77 | +Runs the same prompt with different KV compression levels to demonstrate quality vs. memory trade-offs. |
| 78 | + |
| 79 | +**Compile:** |
| 80 | +```bash |
| 81 | +cc -O2 embed_kv_compare.c -o kv_compare -lm -lpthread |
| 82 | +``` |
| 83 | + |
| 84 | +**Run:** |
| 85 | +```bash |
| 86 | +./kv_compare model.gguf |
| 87 | +``` |
| 88 | + |
| 89 | +**Output:** |
| 90 | +``` |
| 91 | +Prompt: What is the capital of France? |
| 92 | +========================================== |
| 93 | +
|
| 94 | +[FP32 (no compression)] |
| 95 | + Output: Paris |
| 96 | +
|
| 97 | +[4-bit K + Q4 V] |
| 98 | + Output: Paris |
| 99 | +
|
| 100 | +[Delta 3-bit + Q4 V] |
| 101 | + Output: Paris |
| 102 | +``` |
| 103 | + |
| 104 | +**Compression Levels:** |
| 105 | +- `kv_compress=0` - FP32 KV cache (no compression, highest quality) |
| 106 | +- `kv_compress=1` - 4-bit K + Q4 V (7x compression, PPL +0.0%) |
| 107 | +- `kv_compress=2` - Delta 3-bit + Q4 V (aggressive compression) |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +### single_header_example.c |
| 112 | +**Minimal single-header example (~40 lines)** |
| 113 | + |
| 114 | +The absolute minimum code needed to run inference. |
| 115 | + |
| 116 | +**Compile:** |
| 117 | +```bash |
| 118 | +cc -O2 single_header_example.c -o example -lm -lpthread |
| 119 | +``` |
| 120 | + |
| 121 | +**Run:** |
| 122 | +```bash |
| 123 | +./example model.gguf "Hello, world!" |
| 124 | +``` |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## Platform Support |
| 129 | + |
| 130 | +All examples work on: |
| 131 | +- **macOS** (Apple Silicon, Intel) |
| 132 | +- **Linux** (x86_64, ARM64) |
| 133 | +- **Windows** (MSVC, MinGW) |
| 134 | +- **WebAssembly** (via Emscripten) |
| 135 | +- **iOS** (Xcode toolchain) |
| 136 | +- **Android** (NDK) |
| 137 | + |
| 138 | +No external dependencies required beyond libc and pthreads. |
| 139 | + |
| 140 | +## quant.h API Reference |
| 141 | + |
| 142 | +### Configuration |
| 143 | + |
| 144 | +```c |
| 145 | +typedef struct { |
| 146 | + float temperature; // Sampling temperature (default: 0.7) |
| 147 | + float top_p; // Nucleus sampling (default: 0.9) |
| 148 | + int max_tokens; // Max tokens to generate (default: 256) |
| 149 | + int n_threads; // Thread count for matmul (default: 4) |
| 150 | + int kv_compress; // 0=off, 1=4-bit, 2=delta+3-bit (default: 1) |
| 151 | +} quant_config; |
| 152 | +``` |
| 153 | + |
| 154 | +### Functions |
| 155 | + |
| 156 | +| Function | Description | |
| 157 | +|----------|-------------| |
| 158 | +| `quant_load(path)` | Load GGUF model from disk | |
| 159 | +| `quant_new(model, config)` | Create inference context | |
| 160 | +| `quant_generate(ctx, prompt, cb, ud)` | Stream tokens via callback | |
| 161 | +| `quant_ask(ctx, prompt)` | Return full response (caller frees) | |
| 162 | +| `quant_free_ctx(ctx)` | Free context | |
| 163 | +| `quant_free_model(model)` | Free model | |
| 164 | +| `quant_version()` | Get version string | |
| 165 | + |
| 166 | +## Memory Requirements |
| 167 | + |
| 168 | +- **Model loading**: Memory-mapped, minimal RAM overhead |
| 169 | +- **Inference context**: ~2-4GB for 7B models (depends on sequence length) |
| 170 | +- **KV cache compression**: 7x reduction vs FP32 baseline |
| 171 | + |
| 172 | +## Performance Tips |
| 173 | + |
| 174 | +1. **Use KV compression** (`kv_compress=1` or `2`) for 7x longer context |
| 175 | +2. **Adjust thread count** (`n_threads`) based on CPU cores |
| 176 | +3. **Lower temperature** (0.0-0.3) for factual responses |
| 177 | +4. **Higher temperature** (0.7-1.0) for creative writing |
| 178 | + |
| 179 | +## Troubleshooting |
| 180 | + |
| 181 | +**"Failed to load model"** |
| 182 | +- Check model path is correct |
| 183 | +- Verify model is in GGUF format (llama.cpp compatible) |
| 184 | +- Ensure sufficient disk space for mmap |
| 185 | + |
| 186 | +**Slow generation** |
| 187 | +- Increase `n_threads` (up to CPU core count) |
| 188 | +- Enable KV compression if not already |
| 189 | +- Try smaller models for faster inference |
| 190 | + |
| 191 | +**Poor quality output** |
| 192 | +- Adjust `temperature` and `top_p` parameters |
| 193 | +- Try different KV compression levels |
| 194 | +- Ensure model is appropriate for your use case |
| 195 | + |
| 196 | +## Next Steps |
| 197 | + |
| 198 | +- See `docs/api.md` for full API documentation |
| 199 | +- Check `examples/` for advanced usage patterns |
| 200 | +- Read `README.md` for project overview |
| 201 | +- Visit https://github.com/quantumaikr/quant.cpp for more information |
| 202 | + |
| 203 | +## License |
| 204 | + |
| 205 | +Apache 2.0 - See LICENSE file for details. |
0 commit comments