You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Where we match**: Q8_0 on 3B-class models — our NEON int8×int8 fused dot is competitive with llama.cpp's hand-tuned assembly.
416
-
417
-
**Where we lag**: Q4_K_M (mixed Q2_K/Q3_K/Q4_K) — llama.cpp has years of assembly tuning on these K-quant types. We have NEON for Q4_K and Q2_K but not Q3_K, and our implementations are ~2-5× slower than llama.cpp's tuned code.
-**Q4_K_M** (mixed Q2_K/Q3_K/Q4_K) — llama.cpp has years of assembly tuning on K-quant types. We have NEON for Q4_K and Q2_K but not Q3_K.
421
+
-**Large vocab models** (Gemma 4, 262K vocab) — the lm_head matmul alone is 2560×262144. llama.cpp's Q8_0 matmul benefits disproportionately from CPU-specific tiling we haven't implemented.
422
+
-**Tiny models** (<1B) — llama.cpp's scheduling and batch overhead is optimized for cases where per-matmul work is small.
423
+
424
+
**Known correctness issues**:
425
+
-**Qwen3.5-4B** (`architecture = qwen35`, DeltaNet hybrid) — forward pass runs without errors but produces whitespace-only output. llama.cpp produces `<think>...` reasoning. Root cause unidentified.
418
426
419
427
**Why we still exist**: llama.cpp is ~500K LOC (C++/CUDA/Metal/Vulkan). quant.cpp is ~17.6K LOC of C with zero dependencies. If you want the fastest inference, use llama.cpp. If you want something you can read end-to-end, embed in a single `.c` file, or use as a research platform for new KV compression methods — we're the alternative.
0 commit comments