Skip to content

Commit ba2d1aa

Browse files
unamedkrclaude
andcommitted
Add quant.h — single-header LLM inference library (15K LOC)
stb-style single-header C library for LLM inference. One file. Six functions. Zero dependencies beyond libc/pthreads. Usage: #define QUANT_IMPLEMENTATION #include "quant.h" quant_model* m = quant_load("model.gguf"); quant_ctx* c = quant_new(m, NULL); quant_generate(c, "Hello!", print_token, NULL); quant_free_ctx(c); quant_free_model(m); Build: cc app.c -o app -lm -lpthread Features included: - GGUF model loading (all K-quant types, IQ2, BF16, F16) - BPE tokenizer (from GGUF metadata) - Transformer forward pass (Llama, Qwen3.5, Gemma 3/4) - Uniform 4-bit/3-bit/2-bit KV cache compression - Delta KV compression - NEON SIMD (ARM), scalar fallback (x86) - Multi-threaded matmul - Top-p sampling with temperature What's excluded (use full quant.cpp for these): - Metal/CUDA/Vulkan GPU backends - MoE routing - Polar/QJL/Turbo quantization types Tested: SmolLM2-1.7B produces "Paris, a city rich in history" for "The capital of France is" prompt. Qwen3.5-0.8B also works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 71ee7a4 commit ba2d1aa

2 files changed

Lines changed: 15441 additions & 0 deletions

File tree

examples/single_header_example.c

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#define QUANT_IMPLEMENTATION
2+
#include "../quant.h"
3+
#include <stdio.h>
4+
5+
static void print_token(const char* text, void* ud) {
6+
(void)ud;
7+
printf("%s", text);
8+
fflush(stdout);
9+
}
10+
11+
int main(int argc, char** argv) {
12+
if (argc < 2) {
13+
fprintf(stderr, "Usage: %s <model.gguf> [prompt]\n", argv[0]);
14+
return 1;
15+
}
16+
const char* prompt = argc > 2 ? argv[2] : "Hello!";
17+
18+
quant_model* m = quant_load(argv[1]);
19+
if (!m) { fprintf(stderr, "Failed to load model\n"); return 1; }
20+
21+
quant_config cfg = {
22+
.temperature = 0.7f,
23+
.top_p = 0.9f,
24+
.max_tokens = 100,
25+
.n_threads = 4,
26+
.kv_compress = 1 // 4-bit KV compression
27+
};
28+
quant_ctx* c = quant_new(m, &cfg);
29+
30+
printf("Prompt: %s\n---\n", prompt);
31+
quant_generate(c, prompt, print_token, NULL);
32+
printf("\n---\n");
33+
34+
quant_free_ctx(c);
35+
quant_free_model(m);
36+
return 0;
37+
}

0 commit comments

Comments
 (0)