Salvage non-conflicting parts of PR #13

unamedkr · Mohamed Chorfa · claude · unamedkr · commit fc2640ed5952 · 2026-04-08T00:45:18.000+09:00
Cherry-picked from MChorfa's #13 (rejecting the bulk reformatting and quant.h regen, accepting the structural improvements): - examples/README.md: comprehensive embedding examples doc (205 lines) - CMakeLists.txt: TQ_BUILD_EXAMPLES option, separate single-header examples (embed_*, single_header_example) so they link only against libm + Threads — proves quant.h truly stands alone. - ROADMAP.md: mark Direction 1 items complete (api docs, quant.h sync, embedding examples) Co-Authored-By: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -11,6 +11,7 @@ option(TQ_BUILD_METAL "Build Metal backend" OFF)
 option(TQ_BUILD_VULKAN "Build Vulkan backend" OFF)
 option(TQ_BUILD_ROCM "Build ROCm/HIP backend" OFF)
 option(TQ_BUILD_SERVER "Build OpenAI-compatible HTTP server" OFF)
+option(TQ_BUILD_EXAMPLES "Build examples" ON)
 
 # Threads (pthread)
 find_package(Threads REQUIRED)
@@ -276,14 +277,35 @@ if(NOT MSVC)
     target_link_libraries(tq_convert turboquant)
 endif()
 
-# Examples (always built)
-file(GLOB EXAMPLE_C_SOURCES examples/*.c)
-file(GLOB EXAMPLE_CXX_SOURCES examples/*.cpp)
-foreach(ex_src ${EXAMPLE_C_SOURCES} ${EXAMPLE_CXX_SOURCES})
-    get_filename_component(ex_name ${ex_src} NAME_WE)
-    add_executable(${ex_name} ${ex_src})
-    target_link_libraries(${ex_name} turboquant)
-endforeach()
+# Examples
+if(TQ_BUILD_EXAMPLES)
+    file(GLOB EXAMPLE_C_SOURCES examples/*.c)
+    file(GLOB EXAMPLE_CXX_SOURCES examples/*.cpp)
+    foreach(ex_src ${EXAMPLE_C_SOURCES} ${EXAMPLE_CXX_SOURCES})
+        get_filename_component(ex_name ${ex_src} NAME_WE)
+        # Skip single-header examples (built separately below — they use quant.h, not turboquant)
+        if(ex_name MATCHES "^embed_" OR ex_name STREQUAL "single_header_example")
+            continue()
+        endif()
+        add_executable(${ex_name} ${ex_src})
+        target_link_libraries(${ex_name} turboquant)
+    endforeach()
+
+    # Single-header examples (use quant.h directly — link only libm + threads)
+    if(NOT MSVC)
+        add_executable(embed_minimal examples/embed_minimal.c)
+        target_link_libraries(embed_minimal m Threads::Threads)
+
+        add_executable(embed_chat examples/embed_chat.c)
+        target_link_libraries(embed_chat m Threads::Threads)
+
+        add_executable(embed_kv_compare examples/embed_kv_compare.c)
+        target_link_libraries(embed_kv_compare m Threads::Threads)
+
+        add_executable(single_header_example examples/single_header_example.c)
+        target_link_libraries(single_header_example m Threads::Threads)
+    endif()
+endif()
 
 # OpenAI-compatible HTTP server (POSIX only — uses sys/socket.h)
 if(TQ_BUILD_SERVER AND NOT MSVC)
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -27,11 +27,9 @@ The world's simplest way to add LLM to a C/C++ project.
 - [x] WASM build (192KB binary)
 - [x] MSVC/MinGW Windows support
 - [x] Zero external dependencies
-
-### In Progress
-- [ ] API documentation (docs/api.md)
-- [ ] quant.h sync with latest source
-- [ ] Embedding examples (minimal, chat, KV compare)
+- [x] API documentation (docs/api.md)
+- [x] quant.h sync with latest source
+- [x] Embedding examples (minimal, chat, KV compare)
 
 ### Planned
 - [ ] pip install quantcpp (Python bindings)
diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,205 @@
+# quant.cpp Embedding Examples
+
+This directory contains examples demonstrating how to embed quant.cpp (the SQLite of LLM inference) into your C/C++ projects.
+
+## Quick Start
+
+The simplest way to use quant.cpp is with the single-header `quant.h`. No build system required:
+
+```bash
+cc -O2 -o chat embed_chat.c -lm -lpthread
+./chat model.gguf
+```
+
+## Examples
+
+### embed_minimal.c
+**The smallest possible LLM integration (~60 lines)**
+
+Demonstrates the 6-function API:
+- `quant_load()` - Load GGUF model
+- `quant_new()` - Create inference context
+- `quant_generate()` - Stream tokens via callback
+- `quant_free_ctx()` / `quant_free_model()` - Cleanup
+
+**Compile:**
+```bash
+cc -O2 embed_minimal.c -o minimal -lm -lpthread
+```
+
+**Run:**
+```bash
+./minimal smollm2-135m.gguf "Tell me a joke"
+```
+
+**Features:**
+- Zero dependencies (libc + pthreads)
+- Memory-mapped model loading
+- KV cache compression enabled (7x longer context on same hardware)
+- Streaming token output
+
+---
+
+### embed_chat.c
+**Interactive chat application (~60 lines)**
+
+A complete REPL (Read-Eval-Print Loop) for conversational AI.
+
+**Compile:**
+```bash
+cc -O2 embed_chat.c -o chat -lm -lpthread
+```
+
+**Run:**
+```bash
+./chat model.gguf
+```
+
+**Features:**
+- Interactive prompt loop
+- Fresh context per turn (no conversation memory)
+- Ctrl+C to exit
+- Streaming output
+
+**Usage:**
+```
+Model loaded. Type your message (Ctrl+C to exit).
+
+> Hello!
+[AI response streaming...]
+```
+
+---
+
+### embed_kv_compare.c
+**KV compression quality comparison (~60 lines)**
+
+Runs the same prompt with different KV compression levels to demonstrate quality vs. memory trade-offs.
+
+**Compile:**
+```bash
+cc -O2 embed_kv_compare.c -o kv_compare -lm -lpthread
+```
+
+**Run:**
+```bash
+./kv_compare model.gguf
+```
+
+**Output:**
+```
+Prompt: What is the capital of France?
+==========================================
+
+[FP32 (no compression)]
+  Output: Paris
+
+[4-bit K + Q4 V]
+  Output: Paris
+
+[Delta 3-bit + Q4 V]
+  Output: Paris
+```
+
+**Compression Levels:**
+- `kv_compress=0` - FP32 KV cache (no compression, highest quality)
+- `kv_compress=1` - 4-bit K + Q4 V (7x compression, PPL +0.0%)
+- `kv_compress=2` - Delta 3-bit + Q4 V (aggressive compression)
+
+---
+
+### single_header_example.c
+**Minimal single-header example (~40 lines)**
+
+The absolute minimum code needed to run inference.
+
+**Compile:**
+```bash
+cc -O2 single_header_example.c -o example -lm -lpthread
+```
+
+**Run:**
+```bash
+./example model.gguf "Hello, world!"
+```
+
+---
+
+## Platform Support
+
+All examples work on:
+- **macOS** (Apple Silicon, Intel)
+- **Linux** (x86_64, ARM64)
+- **Windows** (MSVC, MinGW)
+- **WebAssembly** (via Emscripten)
+- **iOS** (Xcode toolchain)
+- **Android** (NDK)
+
+No external dependencies required beyond libc and pthreads.
+
+## quant.h API Reference
+
+### Configuration
+
+```c
+typedef struct {
+    float temperature;   // Sampling temperature (default: 0.7)
+    float top_p;         // Nucleus sampling (default: 0.9)
+    int   max_tokens;    // Max tokens to generate (default: 256)
+    int   n_threads;     // Thread count for matmul (default: 4)
+    int   kv_compress;   // 0=off, 1=4-bit, 2=delta+3-bit (default: 1)
+} quant_config;
+```
+
+### Functions
+
+| Function | Description |
+|----------|-------------|
+| `quant_load(path)` | Load GGUF model from disk |
+| `quant_new(model, config)` | Create inference context |
+| `quant_generate(ctx, prompt, cb, ud)` | Stream tokens via callback |
+| `quant_ask(ctx, prompt)` | Return full response (caller frees) |
+| `quant_free_ctx(ctx)` | Free context |
+| `quant_free_model(model)` | Free model |
+| `quant_version()` | Get version string |
+
+## Memory Requirements
+
+- **Model loading**: Memory-mapped, minimal RAM overhead
+- **Inference context**: ~2-4GB for 7B models (depends on sequence length)
+- **KV cache compression**: 7x reduction vs FP32 baseline
+
+## Performance Tips
+
+1. **Use KV compression** (`kv_compress=1` or `2`) for 7x longer context
+2. **Adjust thread count** (`n_threads`) based on CPU cores
+3. **Lower temperature** (0.0-0.3) for factual responses
+4. **Higher temperature** (0.7-1.0) for creative writing
+
+## Troubleshooting
+
+**"Failed to load model"**
+- Check model path is correct
+- Verify model is in GGUF format (llama.cpp compatible)
+- Ensure sufficient disk space for mmap
+
+**Slow generation**
+- Increase `n_threads` (up to CPU core count)
+- Enable KV compression if not already
+- Try smaller models for faster inference
+
+**Poor quality output**
+- Adjust `temperature` and `top_p` parameters
+- Try different KV compression levels
+- Ensure model is appropriate for your use case
+
+## Next Steps
+
+- See `docs/api.md` for full API documentation
+- Check `examples/` for advanced usage patterns
+- Read `README.md` for project overview
+- Visit https://github.com/quantumaikr/quant.cpp for more information
+
+## License
+
+Apache 2.0 - See LICENSE file for details.