|
| 1 | +# TurboQuant.cpp — Work Breakdown Structure v1.2 |
| 2 | + |
| 3 | +**Version**: 1.2 |
| 4 | +**Date**: 2026-04-03 |
| 5 | +**Focus**: llama.cpp Integration, Standard Benchmarks, Killer Demo |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Phase 1: llama.cpp Fork Integration (Days 1-3) |
| 10 | + |
| 11 | +### 1.1 Fork Setup (~2h) |
| 12 | +- [ ] Fork ggerganov/llama.cpp to quantumaikr/llama.cpp |
| 13 | +- [ ] Clone fork locally alongside TurboQuant.cpp |
| 14 | +- [ ] Build baseline llama.cpp: `cmake -B build && cmake --build build` |
| 15 | +- [ ] Verify baseline works: `./build/bin/llama-cli -m model.gguf -p "Hello"` |
| 16 | + |
| 17 | +### 1.2 Add GGML Type (~4h) |
| 18 | +- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.h` to `ggml/include/` |
| 19 | +- [ ] Copy `integrations/llamacpp/patch/ggml-turbo-quant.c` to `ggml/src/` |
| 20 | +- [ ] Add `GGML_TYPE_TQ_KV_1B = 41` to `ggml/include/ggml.h` enum |
| 21 | +- [ ] Increment `GGML_TYPE_COUNT` to 42 |
| 22 | +- [ ] Add type_traits entry in `ggml/src/ggml.c`: |
| 23 | + ```c |
| 24 | + [GGML_TYPE_TQ_KV_1B] = { |
| 25 | + .type_name = "tq_kv_1b", |
| 26 | + .blck_size = 128, |
| 27 | + .type_size = 24, |
| 28 | + .is_quantized = true, |
| 29 | + .to_float = dequantize_row_tq_kv_1b, |
| 30 | + .from_float_ref = quantize_row_tq_kv_1b_ref, |
| 31 | + } |
| 32 | + ``` |
| 33 | +- [ ] Add source to CMakeLists.txt: `ggml/src/ggml-turbo-quant.c` |
| 34 | +- [ ] Build: verify no compile errors |
| 35 | +
|
| 36 | +### 1.3 Enable CLI (~2h) |
| 37 | +- [ ] Add `GGML_TYPE_TQ_KV_1B` to `kv_cache_types` in `common/arg.cpp` |
| 38 | +- [ ] Build and test: `./build/bin/llama-cli -m model.gguf --cache-type-k tq_kv_1b -p "Hello"` |
| 39 | +- [ ] Verify output is coherent (not garbage) |
| 40 | +
|
| 41 | +### 1.4 PPL Verification (~2h) |
| 42 | +- [ ] Run PPL: `./build/bin/llama-perplexity -m model.gguf --cache-type-k tq_kv_1b` |
| 43 | +- [ ] Compare vs `--cache-type-k f16` (baseline) |
| 44 | +- [ ] Compare vs `--cache-type-k q4_0` (llama.cpp native Q4) |
| 45 | +- [ ] Record results in `bench/results/llamacpp_ppl.md` |
| 46 | +
|
| 47 | +### 1.5 Fix Issues (~4h) |
| 48 | +- [ ] If PPL is wrong: debug quantize/dequantize path |
| 49 | +- [ ] If crashes: check block size alignment, type registration |
| 50 | +- [ ] If slow: profile dequantize overhead |
| 51 | +- [ ] Iterate until PPL delta < 0.1% |
| 52 | +
|
| 53 | +--- |
| 54 | +
|
| 55 | +## Phase 2: Standard Benchmarks (Days 3-5) |
| 56 | +
|
| 57 | +### 2.1 WikiText-2 Setup (~2h) |
| 58 | +- [ ] Download WikiText-2 test set |
| 59 | +- [ ] Convert to plain text format for llama-perplexity |
| 60 | +- [ ] Verify baseline PPL matches published numbers (±10%) |
| 61 | +
|
| 62 | +### 2.2 Comprehensive PPL Table (~4h) |
| 63 | +- [ ] Measure on SmolLM2-1.7B (or available Llama-family model): |
| 64 | +
|
| 65 | +| Config | KV Memory/token | WikiText-2 PPL | |
| 66 | +|--------|----------------|----------------| |
| 67 | +| FP16 KV (baseline) | 256 bytes | ? | |
| 68 | +| llama.cpp Q8_0 KV | 128 bytes | ? | |
| 69 | +| llama.cpp Q4_0 KV | 64 bytes | ? | |
| 70 | +| **TurboQuant 1-bit K** | **24 bytes** | **?** | |
| 71 | +
|
| 72 | +- [ ] Measure memory: `vmstat` or `/proc/self/status` during inference |
| 73 | +- [ ] Create comparison chart (ASCII or markdown) |
| 74 | +
|
| 75 | +### 2.3 Memory Crossover Chart (~2h) |
| 76 | +- [ ] Measure RSS at context lengths: 1K, 4K, 8K, 16K, 32K, 64K |
| 77 | +- [ ] For each KV type: FP16, Q4, TQ_1b |
| 78 | +- [ ] Find the crossover point where FP16 OOMs but TQ_1b survives |
| 79 | +- [ ] Create chart for README |
| 80 | +
|
| 81 | +### 2.4 Publish Results (~1h) |
| 82 | +- [ ] Write `bench/results/wikitext2_comparison.md` |
| 83 | +- [ ] Update README with benchmark table |
| 84 | +- [ ] Commit + push |
| 85 | +
|
| 86 | +--- |
| 87 | +
|
| 88 | +## Phase 3: Killer Demo (Days 5-7) |
| 89 | +
|
| 90 | +### 3.1 Long Context Setup (~2h) |
| 91 | +- [ ] Prepare 50-page text (Project Gutenberg, public domain) |
| 92 | +- [ ] Tokenize and verify: ~30K-50K tokens |
| 93 | +- [ ] Test: model loads + generates with `--ctx 65536` |
| 94 | +
|
| 95 | +### 3.2 Demo Script (~2h) |
| 96 | +- [ ] Create `scripts/demo_long_context.sh`: |
| 97 | + ```bash |
| 98 | + # Shows: load book → ask question → get answer → show memory |
| 99 | + ./build/tq_run model.gguf \ |
| 100 | + --ctx 65536 -k turbo_kv_1b -v q4 \ |
| 101 | + -p "$(cat book.txt) \n\nSummarize the key themes:" \ |
| 102 | + -n 200 -M |
| 103 | + ``` |
| 104 | +- [ ] Test on SmolLM2 1.7B (fits in 16GB with 64K context) |
| 105 | + |
| 106 | +### 3.3 Record Demo (~2h) |
| 107 | +- [ ] Install asciinema or screen recorder |
| 108 | +- [ ] Record: build → load model → long context generation → memory stats |
| 109 | +- [ ] Convert to GIF (< 5MB for Reddit) |
| 110 | +- [ ] Upload to GitHub repo |
| 111 | + |
| 112 | +### 3.4 Community Post (~2h) |
| 113 | +- [ ] Write Reddit post: "128K context on 16GB Mac — 7x KV compression with zero PPL loss" |
| 114 | +- [ ] Include: benchmark table, GIF, GitHub link |
| 115 | +- [ ] Post to r/LocalLLM, r/MachineLearning |
| 116 | +- [ ] Prepare answers for expected questions |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## Phase 4: Paper & Release (Days 7-14) |
| 121 | + |
| 122 | +### 4.1 Paper Update (~8h) |
| 123 | +- [ ] Add WikiText-2 results to `docs/technical_report.md` |
| 124 | +- [ ] Add llama.cpp comparison section |
| 125 | +- [ ] Add comparison vs KIVI, Gear (from their published numbers) |
| 126 | +- [ ] Format for arXiv submission |
| 127 | +- [ ] Internal review |
| 128 | + |
| 129 | +### 4.2 GitHub Release (~2h) |
| 130 | +- [ ] Tag: `git tag -a v1.2.0 -m "Release v1.2.0"` |
| 131 | +- [ ] Build binaries: macOS arm64, Linux x86_64 |
| 132 | +- [ ] Create GitHub Release with: |
| 133 | + - Pre-built binaries |
| 134 | + - Benchmark results |
| 135 | + - llama.cpp fork link |
| 136 | + - Quick start guide |
| 137 | + |
| 138 | +### 4.3 Docker Image (~1h) |
| 139 | +- [ ] Build: `docker build -t ghcr.io/quantumaikr/turboquant .` |
| 140 | +- [ ] Push to GHCR |
| 141 | +- [ ] Test: `docker run ghcr.io/quantumaikr/turboquant model.gguf -p "Hello" -k turbo_kv_1b` |
| 142 | + |
| 143 | +### 4.4 Announcement (~2h) |
| 144 | +- [ ] Update README with release badge |
| 145 | +- [ ] Post to HN: "Show HN: 1-bit KV Cache — 7x memory reduction, zero PPL loss" |
| 146 | +- [ ] Tweet/post with benchmark chart |
| 147 | +- [ ] Submit paper to arXiv |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Verification Checkpoints |
| 152 | + |
| 153 | +| Checkpoint | Criteria | When | |
| 154 | +|-----------|---------|------| |
| 155 | +| **V1** | llama.cpp fork builds with TQ type | Day 1 | |
| 156 | +| **V2** | `--cache-type-k tq_kv_1b` produces coherent output | Day 2 | |
| 157 | +| **V3** | WikiText-2 PPL delta < 0.1% vs FP16 | Day 4 | |
| 158 | +| **V4** | Memory table shows TQ < Q4 < Q8 < FP16 | Day 4 | |
| 159 | +| **V5** | 64K+ context demo works on 16GB | Day 6 | |
| 160 | +| **V6** | GitHub Release published | Day 8 | |
| 161 | +| **V7** | arXiv paper submitted | Day 14 | |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Resource Requirements |
| 166 | + |
| 167 | +- llama.cpp fork: ~1 day setup |
| 168 | +- WikiText-2 dataset: free download |
| 169 | +- Model: SmolLM2-1.7B (already downloaded), Qwen 0.8B (available) |
| 170 | +- Hardware: M3 Mac Air 16GB (available) |
| 171 | +- No CUDA GPU needed (CPU benchmarks sufficient for PPL comparison) |
0 commit comments