Skip to content

Commit 49222c5

Browse files
unamedkrclaude
andcommitted
Plan v1.2: From Research to Adoption (target 1000+ stars)
PRD v1.2: 4 phases — llama.cpp fork, standard benchmarks, killer demo, paper WBS v1.2: 20 concrete tasks with verification checkpoints Phase 1: llama.cpp fork with --cache-type-k tq_kv_1b (Days 1-3) Phase 2: WikiText-2 PPL comparison vs Q4/Q8/FP16 (Days 3-5) Phase 3: 128K context demo on 16GB Mac (Days 5-7) Phase 4: arXiv paper + GitHub Release (Days 7-14) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1992bf1 commit 49222c5

2 files changed

Lines changed: 290 additions & 0 deletions

File tree

docs/plan/prd/prd_v1.2.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# TurboQuant.cpp v1.2 PRD — From Research to Adoption
2+
3+
**Version**: 1.2
4+
**Date**: 2026-04-03
5+
**Status**: Active
6+
**Goal**: 1000+ GitHub Stars
7+
8+
---
9+
10+
## Current State
11+
12+
- 50+ stars, 4 architectures verified (Llama, Gemma, Qwen, Qwen-MoE)
13+
- PPL +0.00% at 800 tokens (1-bit K), +0.03% (1-bit K + Q4 V)
14+
- 7.1x KV compression, 33 tests, CI green
15+
- llama.cpp integration patch ready but not submitted
16+
- No standard benchmarks (WikiText-2, MMLU)
17+
- Not usable from llama.cpp/ollama/vLLM
18+
19+
## Problem
20+
21+
TurboQuant is a proven algorithm trapped in a standalone engine. The community can't use it without building from source and abandoning their existing tools. To reach 1000+ stars, the algorithm must live inside tools people already use.
22+
23+
## Objectives
24+
25+
| # | Objective | Success Metric | Priority |
26+
|---|-----------|---------------|----------|
27+
| O1 | llama.cpp integration works end-to-end | `--cache-type-k tq_1b` produces correct output in llama.cpp fork | P0 |
28+
| O2 | Standard benchmark proves the claim | WikiText-2 PPL: TurboQuant 1b ≤ llama.cpp Q4 KV | P0 |
29+
| O3 | Killer demo that goes viral | 128K context on 16GB Mac, GIF/video posted | P1 |
30+
| O4 | One-command experience | `docker run` or `pip install` → working demo | P1 |
31+
| O5 | Academic credibility | arXiv paper with standard benchmarks | P2 |
32+
33+
## Non-Goals
34+
35+
- Building a production serving engine (use vLLM/llama.cpp instead)
36+
- Supporting 50+ architectures (focus on Llama family first)
37+
- Competing on inference speed (our value is memory, not speed)
38+
- CUDA GPU testing (requires hardware we don't have)
39+
40+
---
41+
42+
## Phase 1: llama.cpp Fork (Days 1-3)
43+
44+
**Goal**: TurboQuant KV working inside llama.cpp, not our engine.
45+
46+
1. Fork llama.cpp, apply our integration patch
47+
2. Add GGML_TYPE_TQ_KV_1B to ggml.h
48+
3. Register quantize/dequantize in ggml.c type_traits
49+
4. Add to CLI: `--cache-type-k tq_1b`
50+
5. Build and run: `./llama-cli -m llama3-8b.gguf --cache-type-k tq_1b -p "Hello"`
51+
6. Verify: PPL matches baseline on WikiText-2
52+
53+
**Deliverable**: GitHub repo `quantumaikr/llama.cpp` fork with TurboQuant KV
54+
55+
## Phase 2: Standard Benchmarks (Days 3-5)
56+
57+
**Goal**: Undeniable comparison table.
58+
59+
1. WikiText-2 PPL (full test set, ~36K tokens):
60+
- llama.cpp FP16 KV (baseline)
61+
- llama.cpp Q4_0 KV
62+
- llama.cpp Q8_0 KV
63+
- TurboQuant 1-bit KV
64+
- TurboQuant 1-bit K + Q4 V
65+
66+
2. Memory measurement at 32K context:
67+
- RSS for each configuration
68+
- KV-only memory breakdown
69+
70+
3. Models: SmolLM2-1.7B (available), Qwen3.5-0.8B, Gemma-4B
71+
72+
**Deliverable**: `bench/results/wikitext2_comparison.md` with reproducible results
73+
74+
## Phase 3: Killer Demo (Days 5-7)
75+
76+
**Goal**: "128K context on 16GB Mac" video.
77+
78+
1. Prepare a 50-page public domain text (e.g., Project Gutenberg)
79+
2. Run with `--ctx 131072 -k turbo_kv_1b -v q4`
80+
3. Show: model reads the book, answers questions about it
81+
4. Record: terminal screencast with memory stats overlay
82+
5. Post to r/LocalLLM, r/MachineLearning, HN
83+
84+
**Deliverable**: Demo video/GIF + Reddit post
85+
86+
## Phase 4: Paper & Release (Days 7-14)
87+
88+
**Goal**: Academic credibility + easy installation.
89+
90+
1. Update `docs/technical_report.md` with WikiText-2 results
91+
2. Add comparison with KIVI, Gear, KVQuant (from their papers)
92+
3. Submit to arXiv
93+
4. GitHub Release v1.2.0 with pre-built binaries (macOS arm64, Linux x86_64)
94+
5. Docker image on GitHub Container Registry
95+
96+
**Deliverable**: arXiv paper + GitHub Release + Docker image
97+
98+
---
99+
100+
## Success Criteria
101+
102+
| Metric | Target | How to Measure |
103+
|--------|--------|---------------|
104+
| GitHub Stars | 200+ (path to 1000) | GitHub |
105+
| llama.cpp fork users | 50+ clones | GitHub traffic |
106+
| Reddit upvotes | 50+ on demo post | Reddit |
107+
| WikiText-2 PPL delta | ≤ +0.1% vs FP16 | Benchmark script |
108+
| Standard comparison | TQ 1b < llama.cpp Q4 memory | RSS measurement |
109+
110+
---
111+
112+
## Risks
113+
114+
| Risk | Mitigation |
115+
|------|-----------|
116+
| llama.cpp type system incompatible | Self-contained fork, not PR |
117+
| WikiText-2 PPL worse than expected | Test on our engine first, then port |
118+
| 128K context OOM on 16GB | Use smaller model (0.8B) or 64K |
119+
| arXiv rejection | Post as technical report on GitHub |

docs/plan/wbs/wbs_v1.2.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# TurboQuant.cpp — Work Breakdown Structure v1.2
2+
3+
**Version**: 1.2
4+
**Date**: 2026-04-03
5+
**Focus**: llama.cpp Integration, Standard Benchmarks, Killer Demo
6+
7+
---
8+
9+
## Phase 1: llama.cpp Fork Integration (Days 1-3)
10+
11+
### 1.1 Fork Setup (~2h)
12+
- [ ] Fork ggerganov/llama.cpp to quantumaikr/llama.cpp
13+
- [ ] Clone fork locally alongside TurboQuant.cpp
14+
- [ ] Build baseline llama.cpp: `cmake -B build && cmake --build build`
15+
- [ ] Verify baseline works: `./build/bin/llama-cli -m model.gguf -p "Hello"`
16+
17+
### 1.2 Add GGML Type (~4h)
18+
- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.h` to `ggml/include/`
19+
- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.c` to `ggml/src/`
20+
- [ ] Add `GGML_TYPE_TQ_KV_1B = 41` to `ggml/include/ggml.h` enum
21+
- [ ] Increment `GGML_TYPE_COUNT` to 42
22+
- [ ] Add type_traits entry in `ggml/src/ggml.c`:
23+
```c
24+
[GGML_TYPE_TQ_KV_1B] = {
25+
.type_name = "tq_kv_1b",
26+
.blck_size = 128,
27+
.type_size = 24,
28+
.is_quantized = true,
29+
.to_float = dequantize_row_tq_kv_1b,
30+
.from_float_ref = quantize_row_tq_kv_1b_ref,
31+
}
32+
```
33+
- [ ] Add source to CMakeLists.txt: `ggml/src/ggml-turbo-quant.c`
34+
- [ ] Build: verify no compile errors
35+
36+
### 1.3 Enable CLI (~2h)
37+
- [ ] Add `GGML_TYPE_TQ_KV_1B` to `kv_cache_types` in `common/arg.cpp`
38+
- [ ] Build and test: `./build/bin/llama-cli -m model.gguf --cache-type-k tq_kv_1b -p "Hello"`
39+
- [ ] Verify output is coherent (not garbage)
40+
41+
### 1.4 PPL Verification (~2h)
42+
- [ ] Run PPL: `./build/bin/llama-perplexity -m model.gguf --cache-type-k tq_kv_1b`
43+
- [ ] Compare vs `--cache-type-k f16` (baseline)
44+
- [ ] Compare vs `--cache-type-k q4_0` (llama.cpp native Q4)
45+
- [ ] Record results in `bench/results/llamacpp_ppl.md`
46+
47+
### 1.5 Fix Issues (~4h)
48+
- [ ] If PPL is wrong: debug quantize/dequantize path
49+
- [ ] If crashes: check block size alignment, type registration
50+
- [ ] If slow: profile dequantize overhead
51+
- [ ] Iterate until PPL delta < 0.1%
52+
53+
---
54+
55+
## Phase 2: Standard Benchmarks (Days 3-5)
56+
57+
### 2.1 WikiText-2 Setup (~2h)
58+
- [ ] Download WikiText-2 test set
59+
- [ ] Convert to plain text format for llama-perplexity
60+
- [ ] Verify baseline PPL matches published numbers (±10%)
61+
62+
### 2.2 Comprehensive PPL Table (~4h)
63+
- [ ] Measure on SmolLM2-1.7B (or available Llama-family model):
64+
65+
| Config | KV Memory/token | WikiText-2 PPL |
66+
|--------|----------------|----------------|
67+
| FP16 KV (baseline) | 256 bytes | ? |
68+
| llama.cpp Q8_0 KV | 128 bytes | ? |
69+
| llama.cpp Q4_0 KV | 64 bytes | ? |
70+
| **TurboQuant 1-bit K** | **24 bytes** | **?** |
71+
72+
- [ ] Measure memory: `vmstat` or `/proc/self/status` during inference
73+
- [ ] Create comparison chart (ASCII or markdown)
74+
75+
### 2.3 Memory Crossover Chart (~2h)
76+
- [ ] Measure RSS at context lengths: 1K, 4K, 8K, 16K, 32K, 64K
77+
- [ ] For each KV type: FP16, Q4, TQ_1b
78+
- [ ] Find the crossover point where FP16 OOMs but TQ_1b survives
79+
- [ ] Create chart for README
80+
81+
### 2.4 Publish Results (~1h)
82+
- [ ] Write `bench/results/wikitext2_comparison.md`
83+
- [ ] Update README with benchmark table
84+
- [ ] Commit + push
85+
86+
---
87+
88+
## Phase 3: Killer Demo (Days 5-7)
89+
90+
### 3.1 Long Context Setup (~2h)
91+
- [ ] Prepare 50-page text (Project Gutenberg, public domain)
92+
- [ ] Tokenize and verify: ~30K-50K tokens
93+
- [ ] Test: model loads + generates with `--ctx 65536`
94+
95+
### 3.2 Demo Script (~2h)
96+
- [ ] Create `scripts/demo_long_context.sh`:
97+
```bash
98+
# Shows: load book → ask question → get answer → show memory
99+
./build/tq_run model.gguf \
100+
--ctx 65536 -k turbo_kv_1b -v q4 \
101+
-p "$(cat book.txt) \n\nSummarize the key themes:" \
102+
-n 200 -M
103+
```
104+
- [ ] Test on SmolLM2 1.7B (fits in 16GB with 64K context)
105+
106+
### 3.3 Record Demo (~2h)
107+
- [ ] Install asciinema or screen recorder
108+
- [ ] Record: build → load model → long context generation → memory stats
109+
- [ ] Convert to GIF (< 5MB for Reddit)
110+
- [ ] Upload to GitHub repo
111+
112+
### 3.4 Community Post (~2h)
113+
- [ ] Write Reddit post: "128K context on 16GB Mac — 7x KV compression with zero PPL loss"
114+
- [ ] Include: benchmark table, GIF, GitHub link
115+
- [ ] Post to r/LocalLLM, r/MachineLearning
116+
- [ ] Prepare answers for expected questions
117+
118+
---
119+
120+
## Phase 4: Paper & Release (Days 7-14)
121+
122+
### 4.1 Paper Update (~8h)
123+
- [ ] Add WikiText-2 results to `docs/technical_report.md`
124+
- [ ] Add llama.cpp comparison section
125+
- [ ] Add comparison vs KIVI, Gear (from their published numbers)
126+
- [ ] Format for arXiv submission
127+
- [ ] Internal review
128+
129+
### 4.2 GitHub Release (~2h)
130+
- [ ] Tag: `git tag -a v1.2.0 -m "Release v1.2.0"`
131+
- [ ] Build binaries: macOS arm64, Linux x86_64
132+
- [ ] Create GitHub Release with:
133+
- Pre-built binaries
134+
- Benchmark results
135+
- llama.cpp fork link
136+
- Quick start guide
137+
138+
### 4.3 Docker Image (~1h)
139+
- [ ] Build: `docker build -t ghcr.io/quantumaikr/turboquant .`
140+
- [ ] Push to GHCR
141+
- [ ] Test: `docker run ghcr.io/quantumaikr/turboquant model.gguf -p "Hello" -k turbo_kv_1b`
142+
143+
### 4.4 Announcement (~2h)
144+
- [ ] Update README with release badge
145+
- [ ] Post to HN: "Show HN: 1-bit KV Cache — 7x memory reduction, zero PPL loss"
146+
- [ ] Tweet/post with benchmark chart
147+
- [ ] Submit paper to arXiv
148+
149+
---
150+
151+
## Verification Checkpoints
152+
153+
| Checkpoint | Criteria | When |
154+
|-----------|---------|------|
155+
| **V1** | llama.cpp fork builds with TQ type | Day 1 |
156+
| **V2** | `--cache-type-k tq_kv_1b` produces coherent output | Day 2 |
157+
| **V3** | WikiText-2 PPL delta < 0.1% vs FP16 | Day 4 |
158+
| **V4** | Memory table shows TQ < Q4 < Q8 < FP16 | Day 4 |
159+
| **V5** | 64K+ context demo works on 16GB | Day 6 |
160+
| **V6** | GitHub Release published | Day 8 |
161+
| **V7** | arXiv paper submitted | Day 14 |
162+
163+
---
164+
165+
## Resource Requirements
166+
167+
- llama.cpp fork: ~1 day setup
168+
- WikiText-2 dataset: free download
169+
- Model: SmolLM2-1.7B (already downloaded), Qwen 0.8B (available)
170+
- Hardware: M3 Mac Air 16GB (available)
171+
- No CUDA GPU needed (CPU benchmarks sufficient for PPL comparison)

0 commit comments

Comments
 (0)