Skip to content

Commit 6095573

Browse files
unamedkrclaude
andcommitted
FAQ: add "llama.cpp Q4 KV vs quant.cpp" comparison (en + ko)
New FAQ entry addresses the most common question: "llama.cpp already has Q4 KV quantization, how is yours different?" Answer: same 4-bit, different quality. llama.cpp Q4_0 = PPL +10.6%, quant.cpp = PPL +0.0%. Difference is independent K/V quantization plus delta compression (3-bit at +1.3%, no llama.cpp equivalent). Applied to both README.md and README.ko.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2b8d09d commit 6095573

2 files changed

Lines changed: 18 additions & 2 deletions

File tree

README.ko.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,15 @@ GGUF 포맷. llama.cpp 호환 모델 파일을 그대로 사용합니다.
197197

198198
**llama.cpp와 뭐가 다른가요?**
199199

200-
llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 내 프로젝트에 넣을 수 있는 미니멀 엔진 (33K LOC). KV 압축 차이: llama.cpp Q4_0은 SmolLM2 1.7B에서 PPL +10.6%. quant.cpp 4-bit K는 PPL +0.0%.
200+
llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.cpp는 읽고, 수정하고, 내 프로젝트에 넣을 수 있는 미니멀 엔진 (33K LOC).
201+
202+
**llama.cpp/ollama에도 Q4 KV 양자화가 있는데, 뭐가 다른가요?**
203+
204+
둘 다 4-bit이지만 품질 차이가 큽니다. SmolLM2 1.7B 기준:
205+
- llama.cpp Q4_0 KV: PPL **+10.6%** (눈에 띄는 저하)
206+
- quant.cpp 4-bit K: PPL **+0.0%** (무손실)
207+
208+
차이점: llama.cpp는 K와 V에 같은 양자화를 적용합니다. quant.cpp는 K와 V를 각각 최적 방식으로 독립 양자화합니다. 추가로 quant.cpp만의 delta 압축이 있습니다 — 인접 key의 차이만 저장하여 3-bit까지 내리면서 PPL +1.3%만 상승. llama.cpp에는 이 기능이 없습니다.
201209

202210
**내 앱에 임베딩할 수 있나요?**
203211

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,15 @@ GGUF format. Load any llama.cpp-compatible model file.
191191

192192
**How is this different from llama.cpp?**
193193

194-
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.
194+
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project.
195+
196+
**llama.cpp/ollama already has Q4 KV quantization. How is yours better?**
197+
198+
Both use 4 bits per element, but quality differs significantly. On SmolLM2 1.7B:
199+
- llama.cpp Q4_0 KV: PPL **+10.6%** (noticeable degradation)
200+
- quant.cpp 4-bit K: PPL **+0.0%** (lossless)
201+
202+
The difference: llama.cpp applies the same quantization scheme to both K and V. quant.cpp quantizes K and V independently with type-appropriate methods. Additionally, quant.cpp offers delta compression — encoding the difference between adjacent keys instead of absolute values — which pushes to 3-bit at only +1.3% PPL. llama.cpp has no equivalent.
195203

196204
**Can I embed this in my app?**
197205

0 commit comments

Comments
 (0)