Plan v1.2: From Research to Adoption (target 1000+ stars)

unamedkr · claude · unamedkr · commit 49222c5e3cf0 · 2026-04-03T02:55:27.000+09:00
PRD v1.2: 4 phases — llama.cpp fork, standard benchmarks, killer demo, paper
WBS v1.2: 20 concrete tasks with verification checkpoints

Phase 1: llama.cpp fork with --cache-type-k tq_kv_1b (Days 1-3)
Phase 2: WikiText-2 PPL comparison vs Q4/Q8/FP16 (Days 3-5)
Phase 3: 128K context demo on 16GB Mac (Days 5-7)
Phase 4: arXiv paper + GitHub Release (Days 7-14)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/plan/prd/prd_v1.2.md b/docs/plan/prd/prd_v1.2.md
@@ -0,0 +1,119 @@
+# TurboQuant.cpp v1.2 PRD — From Research to Adoption
+
+**Version**: 1.2
+**Date**: 2026-04-03
+**Status**: Active
+**Goal**: 1000+ GitHub Stars
+
+---
+
+## Current State
+
+- 50+ stars, 4 architectures verified (Llama, Gemma, Qwen, Qwen-MoE)
+- PPL +0.00% at 800 tokens (1-bit K), +0.03% (1-bit K + Q4 V)
+- 7.1x KV compression, 33 tests, CI green
+- llama.cpp integration patch ready but not submitted
+- No standard benchmarks (WikiText-2, MMLU)
+- Not usable from llama.cpp/ollama/vLLM
+
+## Problem
+
+TurboQuant is a proven algorithm trapped in a standalone engine. The community can't use it without building from source and abandoning their existing tools. To reach 1000+ stars, the algorithm must live inside tools people already use.
+
+## Objectives
+
+| # | Objective | Success Metric | Priority |
+|---|-----------|---------------|----------|
+| O1 | llama.cpp integration works end-to-end | `--cache-type-k tq_1b` produces correct output in llama.cpp fork | P0 |
+| O2 | Standard benchmark proves the claim | WikiText-2 PPL: TurboQuant 1b ≤ llama.cpp Q4 KV | P0 |
+| O3 | Killer demo that goes viral | 128K context on 16GB Mac, GIF/video posted | P1 |
+| O4 | One-command experience | `docker run` or `pip install` → working demo | P1 |
+| O5 | Academic credibility | arXiv paper with standard benchmarks | P2 |
+
+## Non-Goals
+
+- Building a production serving engine (use vLLM/llama.cpp instead)
+- Supporting 50+ architectures (focus on Llama family first)
+- Competing on inference speed (our value is memory, not speed)
+- CUDA GPU testing (requires hardware we don't have)
+
+---
+
+## Phase 1: llama.cpp Fork (Days 1-3)
+
+**Goal**: TurboQuant KV working inside llama.cpp, not our engine.
+
+1. Fork llama.cpp, apply our integration patch
+2. Add GGML_TYPE_TQ_KV_1B to ggml.h
+3. Register quantize/dequantize in ggml.c type_traits
+4. Add to CLI: `--cache-type-k tq_1b`
+5. Build and run: `./llama-cli -m llama3-8b.gguf --cache-type-k tq_1b -p "Hello"`
+6. Verify: PPL matches baseline on WikiText-2
+
+**Deliverable**: GitHub repo `quantumaikr/llama.cpp` fork with TurboQuant KV
+
+## Phase 2: Standard Benchmarks (Days 3-5)
+
+**Goal**: Undeniable comparison table.
+
+1. WikiText-2 PPL (full test set, ~36K tokens):
+   - llama.cpp FP16 KV (baseline)
+   - llama.cpp Q4_0 KV
+   - llama.cpp Q8_0 KV
+   - TurboQuant 1-bit KV
+   - TurboQuant 1-bit K + Q4 V
+
+2. Memory measurement at 32K context:
+   - RSS for each configuration
+   - KV-only memory breakdown
+
+3. Models: SmolLM2-1.7B (available), Qwen3.5-0.8B, Gemma-4B
+
+**Deliverable**: `bench/results/wikitext2_comparison.md` with reproducible results
+
+## Phase 3: Killer Demo (Days 5-7)
+
+**Goal**: "128K context on 16GB Mac" video.
+
+1. Prepare a 50-page public domain text (e.g., Project Gutenberg)
+2. Run with `--ctx 131072 -k turbo_kv_1b -v q4`
+3. Show: model reads the book, answers questions about it
+4. Record: terminal screencast with memory stats overlay
+5. Post to r/LocalLLM, r/MachineLearning, HN
+
+**Deliverable**: Demo video/GIF + Reddit post
+
+## Phase 4: Paper & Release (Days 7-14)
+
+**Goal**: Academic credibility + easy installation.
+
+1. Update `docs/technical_report.md` with WikiText-2 results
+2. Add comparison with KIVI, Gear, KVQuant (from their papers)
+3. Submit to arXiv
+4. GitHub Release v1.2.0 with pre-built binaries (macOS arm64, Linux x86_64)
+5. Docker image on GitHub Container Registry
+
+**Deliverable**: arXiv paper + GitHub Release + Docker image
+
+---
+
+## Success Criteria
+
+| Metric | Target | How to Measure |
+|--------|--------|---------------|
+| GitHub Stars | 200+ (path to 1000) | GitHub |
+| llama.cpp fork users | 50+ clones | GitHub traffic |
+| Reddit upvotes | 50+ on demo post | Reddit |
+| WikiText-2 PPL delta | ≤ +0.1% vs FP16 | Benchmark script |
+| Standard comparison | TQ 1b < llama.cpp Q4 memory | RSS measurement |
+
+---
+
+## Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| llama.cpp type system incompatible | Self-contained fork, not PR |
+| WikiText-2 PPL worse than expected | Test on our engine first, then port |
+| 128K context OOM on 16GB | Use smaller model (0.8B) or 64K |
+| arXiv rejection | Post as technical report on GitHub |
diff --git a/docs/plan/wbs/wbs_v1.2.md b/docs/plan/wbs/wbs_v1.2.md
@@ -0,0 +1,171 @@
+# TurboQuant.cpp — Work Breakdown Structure v1.2
+
+**Version**: 1.2
+**Date**: 2026-04-03
+**Focus**: llama.cpp Integration, Standard Benchmarks, Killer Demo
+
+---
+
+## Phase 1: llama.cpp Fork Integration (Days 1-3)
+
+### 1.1 Fork Setup (~2h)
+- [ ] Fork ggerganov/llama.cpp to quantumaikr/llama.cpp
+- [ ] Clone fork locally alongside TurboQuant.cpp
+- [ ] Build baseline llama.cpp: `cmake -B build && cmake --build build`
+- [ ] Verify baseline works: `./build/bin/llama-cli -m model.gguf -p "Hello"`
+
+### 1.2 Add GGML Type (~4h)
+- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.h` to `ggml/include/`
+- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.c` to `ggml/src/`
+- [ ] Add `GGML_TYPE_TQ_KV_1B = 41` to `ggml/include/ggml.h` enum
+- [ ] Increment `GGML_TYPE_COUNT` to 42
+- [ ] Add type_traits entry in `ggml/src/ggml.c`:
+  ```c
+  [GGML_TYPE_TQ_KV_1B] = {
+      .type_name = "tq_kv_1b",
+      .blck_size = 128,
+      .type_size = 24,
+      .is_quantized = true,
+      .to_float = dequantize_row_tq_kv_1b,
+      .from_float_ref = quantize_row_tq_kv_1b_ref,
+  }
+  ```
+- [ ] Add source to CMakeLists.txt: `ggml/src/ggml-turbo-quant.c`
+- [ ] Build: verify no compile errors
+
+### 1.3 Enable CLI (~2h)
+- [ ] Add `GGML_TYPE_TQ_KV_1B` to `kv_cache_types` in `common/arg.cpp`
+- [ ] Build and test: `./build/bin/llama-cli -m model.gguf --cache-type-k tq_kv_1b -p "Hello"`
+- [ ] Verify output is coherent (not garbage)
+
+### 1.4 PPL Verification (~2h)
+- [ ] Run PPL: `./build/bin/llama-perplexity -m model.gguf --cache-type-k tq_kv_1b`
+- [ ] Compare vs `--cache-type-k f16` (baseline)
+- [ ] Compare vs `--cache-type-k q4_0` (llama.cpp native Q4)
+- [ ] Record results in `bench/results/llamacpp_ppl.md`
+
+### 1.5 Fix Issues (~4h)
+- [ ] If PPL is wrong: debug quantize/dequantize path
+- [ ] If crashes: check block size alignment, type registration
+- [ ] If slow: profile dequantize overhead
+- [ ] Iterate until PPL delta < 0.1%
+
+---
+
+## Phase 2: Standard Benchmarks (Days 3-5)
+
+### 2.1 WikiText-2 Setup (~2h)
+- [ ] Download WikiText-2 test set
+- [ ] Convert to plain text format for llama-perplexity
+- [ ] Verify baseline PPL matches published numbers (±10%)
+
+### 2.2 Comprehensive PPL Table (~4h)
+- [ ] Measure on SmolLM2-1.7B (or available Llama-family model):
+
+| Config | KV Memory/token | WikiText-2 PPL |
+|--------|----------------|----------------|
+| FP16 KV (baseline) | 256 bytes | ? |
+| llama.cpp Q8_0 KV | 128 bytes | ? |
+| llama.cpp Q4_0 KV | 64 bytes | ? |
+| **TurboQuant 1-bit K** | **24 bytes** | **?** |
+
+- [ ] Measure memory: `vmstat` or `/proc/self/status` during inference
+- [ ] Create comparison chart (ASCII or markdown)
+
+### 2.3 Memory Crossover Chart (~2h)
+- [ ] Measure RSS at context lengths: 1K, 4K, 8K, 16K, 32K, 64K
+- [ ] For each KV type: FP16, Q4, TQ_1b
+- [ ] Find the crossover point where FP16 OOMs but TQ_1b survives
+- [ ] Create chart for README
+
+### 2.4 Publish Results (~1h)
+- [ ] Write `bench/results/wikitext2_comparison.md`
+- [ ] Update README with benchmark table
+- [ ] Commit + push
+
+---
+
+## Phase 3: Killer Demo (Days 5-7)
+
+### 3.1 Long Context Setup (~2h)
+- [ ] Prepare 50-page text (Project Gutenberg, public domain)
+- [ ] Tokenize and verify: ~30K-50K tokens
+- [ ] Test: model loads + generates with `--ctx 65536`
+
+### 3.2 Demo Script (~2h)
+- [ ] Create `scripts/demo_long_context.sh`:
+  ```bash
+  # Shows: load book → ask question → get answer → show memory
+  ./build/tq_run model.gguf \
+    --ctx 65536 -k turbo_kv_1b -v q4 \
+    -p "$(cat book.txt) \n\nSummarize the key themes:" \
+    -n 200 -M
+  ```
+- [ ] Test on SmolLM2 1.7B (fits in 16GB with 64K context)
+
+### 3.3 Record Demo (~2h)
+- [ ] Install asciinema or screen recorder
+- [ ] Record: build → load model → long context generation → memory stats
+- [ ] Convert to GIF (< 5MB for Reddit)
+- [ ] Upload to GitHub repo
+
+### 3.4 Community Post (~2h)
+- [ ] Write Reddit post: "128K context on 16GB Mac — 7x KV compression with zero PPL loss"
+- [ ] Include: benchmark table, GIF, GitHub link
+- [ ] Post to r/LocalLLM, r/MachineLearning
+- [ ] Prepare answers for expected questions
+
+---
+
+## Phase 4: Paper & Release (Days 7-14)
+
+### 4.1 Paper Update (~8h)
+- [ ] Add WikiText-2 results to `docs/technical_report.md`
+- [ ] Add llama.cpp comparison section
+- [ ] Add comparison vs KIVI, Gear (from their published numbers)
+- [ ] Format for arXiv submission
+- [ ] Internal review
+
+### 4.2 GitHub Release (~2h)
+- [ ] Tag: `git tag -a v1.2.0 -m "Release v1.2.0"`
+- [ ] Build binaries: macOS arm64, Linux x86_64
+- [ ] Create GitHub Release with:
+  - Pre-built binaries
+  - Benchmark results
+  - llama.cpp fork link
+  - Quick start guide
+
+### 4.3 Docker Image (~1h)
+- [ ] Build: `docker build -t ghcr.io/quantumaikr/turboquant .`
+- [ ] Push to GHCR
+- [ ] Test: `docker run ghcr.io/quantumaikr/turboquant model.gguf -p "Hello" -k turbo_kv_1b`
+
+### 4.4 Announcement (~2h)
+- [ ] Update README with release badge
+- [ ] Post to HN: "Show HN: 1-bit KV Cache — 7x memory reduction, zero PPL loss"
+- [ ] Tweet/post with benchmark chart
+- [ ] Submit paper to arXiv
+
+---
+
+## Verification Checkpoints
+
+| Checkpoint | Criteria | When |
+|-----------|---------|------|
+| **V1** | llama.cpp fork builds with TQ type | Day 1 |
+| **V2** | `--cache-type-k tq_kv_1b` produces coherent output | Day 2 |
+| **V3** | WikiText-2 PPL delta < 0.1% vs FP16 | Day 4 |
+| **V4** | Memory table shows TQ < Q4 < Q8 < FP16 | Day 4 |
+| **V5** | 64K+ context demo works on 16GB | Day 6 |
+| **V6** | GitHub Release published | Day 8 |
+| **V7** | arXiv paper submitted | Day 14 |
+
+---
+
+## Resource Requirements
+
+- llama.cpp fork: ~1 day setup
+- WikiText-2 dataset: free download
+- Model: SmolLM2-1.7B (already downloaded), Qwen 0.8B (available)
+- Hardware: M3 Mac Air 16GB (available)
+- No CUDA GPU needed (CPU benchmarks sufficient for PPL comparison)