Commit ba2d1aa
Add quant.h — single-header LLM inference library (15K LOC)
stb-style single-header C library for LLM inference.
One file. Six functions. Zero dependencies beyond libc/pthreads.
Usage:
#define QUANT_IMPLEMENTATION
#include "quant.h"
quant_model* m = quant_load("model.gguf");
quant_ctx* c = quant_new(m, NULL);
quant_generate(c, "Hello!", print_token, NULL);
quant_free_ctx(c);
quant_free_model(m);
Build: cc app.c -o app -lm -lpthread
Features included:
- GGUF model loading (all K-quant types, IQ2, BF16, F16)
- BPE tokenizer (from GGUF metadata)
- Transformer forward pass (Llama, Qwen3.5, Gemma 3/4)
- Uniform 4-bit/3-bit/2-bit KV cache compression
- Delta KV compression
- NEON SIMD (ARM), scalar fallback (x86)
- Multi-threaded matmul
- Top-p sampling with temperature
What's excluded (use full quant.cpp for these):
- Metal/CUDA/Vulkan GPU backends
- MoE routing
- Polar/QJL/Turbo quantization types
Tested: SmolLM2-1.7B produces "Paris, a city rich in history"
for "The capital of France is" prompt. Qwen3.5-0.8B also works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 71ee7a4 commit ba2d1aa
2 files changed
Lines changed: 15441 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
0 commit comments