FAQ: add "llama.cpp Q4 KV vs quant.cpp" comparison (en + ko)

unamedkr · claude · unamedkr · commit 6095573475b4 · 2026-04-04T10:10:57.000+09:00
New FAQ entry addresses the most common question: "llama.cpp already
has Q4 KV quantization, how is yours different?"

Answer: same 4-bit, different quality. llama.cpp Q4_0 = PPL +10.6%,
quant.cpp = PPL +0.0%. Difference is independent K/V quantization
plus delta compression (3-bit at +1.3%, no llama.cpp equivalent).

Applied to both README.md and README.ko.md.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -197,7 +197,15 @@ GGUF 포맷. llama.cpp 호환 모델 파일을 그대로 사용합니다.
 
 **llama.cpp와 뭐가 다른가요?**
 
-llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 내 프로젝트에 넣을 수 있는 미니멀 엔진 (33K LOC). KV 압축 차이: llama.cpp Q4_0은 SmolLM2 1.7B에서 PPL +10.6%. quant.cpp 4-bit K는 PPL +0.0%.
+llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 내 프로젝트에 넣을 수 있는 미니멀 엔진 (33K LOC).
+
+**llama.cpp/ollama에도 Q4 KV 양자화가 있는데, 뭐가 다른가요?**
+
+둘 다 4-bit이지만 품질 차이가 큽니다. SmolLM2 1.7B 기준:
+- llama.cpp Q4_0 KV: PPL **+10.6%** (눈에 띄는 저하)
+- quant.cpp 4-bit K: PPL **+0.0%** (무손실)
+
+차이점: llama.cpp는 K와 V에 같은 양자화를 적용합니다. quant.cpp는 K와 V를 각각 최적 방식으로 독립 양자화합니다. 추가로 quant.cpp만의 delta 압축이 있습니다 — 인접 key의 차이만 저장하여 3-bit까지 내리면서 PPL +1.3%만 상승. llama.cpp에는 이 기능이 없습니다.
 
 **내 앱에 임베딩할 수 있나요?**
 
diff --git a/README.md b/README.md
@@ -191,7 +191,15 @@ GGUF format. Load any llama.cpp-compatible model file.
 
 **How is this different from llama.cpp?**
 
-llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.
+llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project.
+
+**llama.cpp/ollama already has Q4 KV quantization. How is yours better?**
+
+Both use 4 bits per element, but quality differs significantly. On SmolLM2 1.7B:
+- llama.cpp Q4_0 KV: PPL **+10.6%** (noticeable degradation)
+- quant.cpp 4-bit K: PPL **+0.0%** (lossless)
+
+The difference: llama.cpp applies the same quantization scheme to both K and V. quant.cpp quantizes K and V independently with type-appropriate methods. Additionally, quant.cpp offers delta compression — encoding the difference between adjacent keys instead of absolute values — which pushes to 3-bit at only +1.3% PPL. llama.cpp has no equivalent.
 
 **Can I embed this in my app?**