|
| 1 | +# RLV Roadmap v2 — Post-Breakthrough |
| 2 | + |
| 3 | +> **Date**: 2026-04-12 |
| 4 | +> **Model**: Phi-3.5-mini (Q8_0, fixed) |
| 5 | +> **Language**: English (fixed) |
| 6 | +> **Baseline**: 19/20 (95%) on 20-question Wikitext stress test |
| 7 | +> **Key constraint**: speed (currently ~4min/question, target ~30s) |
| 8 | +
|
| 9 | +--- |
| 10 | + |
| 11 | +## Current State |
| 12 | + |
| 13 | +| Metric | Value | |
| 14 | +|--------|-------| |
| 15 | +| Accuracy | 19/20 (95%) | |
| 16 | +| Speed | ~4 min/question | |
| 17 | +| Model | Phi-3.5-mini Q8_0, 3.8B, CPU | |
| 18 | +| Server | quant-server-unified (quant.h) | |
| 19 | +| Inference | ~6.5 tok/s | |
| 20 | +| Bottleneck | Every question = locator LLM + lookup LLM + verifier LLM = 3+ inference calls | |
| 21 | + |
| 22 | +## Speed Budget Analysis |
| 23 | + |
| 24 | +``` |
| 25 | +Current per-question breakdown (~240s): |
| 26 | + Locator LLM call: ~15s (classify chunk) |
| 27 | + Lookup LLM call: ~20s (extract answer, 64 tokens) |
| 28 | + Verifier: ~5s (literal check, fast) |
| 29 | + Research retries: ~200s (0-3 extra locate+lookup cycles) |
| 30 | +
|
| 31 | +Without retries (~40s): |
| 32 | + Locator + Lookup + Verify = ~40s per question |
| 33 | +
|
| 34 | +Target: 30s without retries, ~60s with 1 retry |
| 35 | +``` |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## Phase 1: Speed (Week 1) — "4min → 30s" |
| 40 | + |
| 41 | +### 1.1 Eliminate locator LLM call |
| 42 | +**Impact: -15s/question** |
| 43 | + |
| 44 | +The LLM classification was unreliable (Loop 5 finding: RRF-first beats LLM). Remove the LLM call entirely from the locator — use pure BM25+keyword RRF. |
| 45 | + |
| 46 | +```python |
| 47 | +# Before: BM25 + keyword + LLM call (15s) |
| 48 | +# After: BM25 + keyword only (0.01s) |
| 49 | +``` |
| 50 | + |
| 51 | +### 1.2 KV cache pre-build with save_context |
| 52 | +**Impact: -10s on lookup (eliminates prefill)** |
| 53 | + |
| 54 | +Pre-compute KV cache for each chunk during gist stage. At lookup time, `load_context` instead of re-prefilling. |
| 55 | + |
| 56 | +```python |
| 57 | +# During gist build (one-time): |
| 58 | +for chunk in chunks: |
| 59 | + ctx = quant_new(model, config) |
| 60 | + quant_generate(ctx, chunk.text, null_callback, null) # prefill |
| 61 | + quant_save_context(ctx, f"cache/{chunk.id}.kv") |
| 62 | + |
| 63 | +# During lookup (per-question): |
| 64 | +ctx = quant_new(model, config) |
| 65 | +quant_load_context(ctx, f"cache/{chunk_id}.kv") # instant |
| 66 | +quant_generate(ctx, question, on_token, data) # generate only |
| 67 | +``` |
| 68 | + |
| 69 | +### 1.3 Reduce max_tokens |
| 70 | +**Impact: -5s per LLM call** |
| 71 | + |
| 72 | +Most answers are <20 tokens. Reduce from 64 to 24 for lookup, 8 for locator. |
| 73 | + |
| 74 | +### 1.4 Parallel-safe server (connection reuse) |
| 75 | +**Impact: -2s (eliminate server restart overhead)** |
| 76 | + |
| 77 | +Keep server running across questions. Current overhead: 0.5s startup per question × 3 calls = 1.5s waste. |
| 78 | + |
| 79 | +### Phase 1 Target |
| 80 | + |
| 81 | +| Metric | Before | After | |
| 82 | +|--------|--------|-------| |
| 83 | +| Locator | 15s (LLM) | **0.01s** (BM25 only) | |
| 84 | +| Lookup | 20s | **10s** (KV cache + fewer tokens) | |
| 85 | +| Verify | 5s | **2s** (literal only, no LLM fallback) | |
| 86 | +| Retries | 200s | **30s** (faster per-retry) | |
| 87 | +| **Total (no retry)** | **40s** | **~12s** | |
| 88 | +| **Total (1 retry)** | **~80s** | **~25s** | |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Phase 2: Robustness (Week 2) — "Always correct or honestly uncertain" |
| 93 | + |
| 94 | +### 2.1 Unanswerable question detection |
| 95 | +Test with questions that have NO answer in the document. RLV should return "I don't know" with high confidence. |
| 96 | + |
| 97 | +### 2.2 Long document scaling (100K+ tokens) |
| 98 | +Test with 100K token documents (50+ chunks). Verify locator BM25 accuracy doesn't degrade. |
| 99 | + |
| 100 | +### 2.3 Adversarial questions |
| 101 | +- Misleading questions ("What year did Du Fu visit Paris?" — never happened) |
| 102 | +- Ambiguous questions ("Who is the poet?" — multiple poets in document) |
| 103 | +- Questions spanning multiple chunks |
| 104 | + |
| 105 | +### 2.4 Accuracy regression suite |
| 106 | +Freeze the 20-question Wikitext + 7-question Acme as a CI regression test. Any code change must pass 26/27+. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Phase 3: Depth (Week 3) — "Harder questions" |
| 111 | + |
| 112 | +### 3.1 Comparison questions |
| 113 | +"Compare Du Fu's early career with his later years" → requires reading 2+ chunks and synthesizing. |
| 114 | + |
| 115 | +### 3.2 Query decomposition |
| 116 | +"What was happening in Du Fu's personal life when the An Lushan Rebellion began?" → decompose into: |
| 117 | +1. "When did the An Lushan Rebellion begin?" → 755 |
| 118 | +2. "What was Du Fu's situation in 755?" → family, famine |
| 119 | + |
| 120 | +### 3.3 Multi-document RLV |
| 121 | +Feed 3 separate documents. "Compare the careers of Robert Boulter and Du Fu" → cross-document synthesis. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Phase 4: Product (Week 4) — "People can use it" |
| 126 | + |
| 127 | +### 4.1 CLI integration |
| 128 | +```bash |
| 129 | +quantcpp rlv --doc document.txt "What is the main argument?" |
| 130 | +``` |
| 131 | + |
| 132 | +### 4.2 HTTP API |
| 133 | +```bash |
| 134 | +quantcpp rlv serve --doc document.txt --port 8080 |
| 135 | +curl localhost:8080/ask -d '{"question": "..."}' |
| 136 | +``` |
| 137 | + |
| 138 | +### 4.3 Technical report |
| 139 | +"RLV: Cliff-Aware Document QA for Small Language Models" |
| 140 | +- Cliff measurement methodology |
| 141 | +- RLV architecture + Karpathy loop evolution |
| 142 | +- 19/20 result + ablation study |
| 143 | +- Comparison: RLV vs RAG vs Long-context |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Success Criteria |
| 148 | + |
| 149 | +| Phase | Gate | Metric | |
| 150 | +|-------|------|--------| |
| 151 | +| 1 | Speed | < 30s/question (no retry) | |
| 152 | +| 2 | Robustness | 90%+ on new 20 questions + 100% unanswerable detection | |
| 153 | +| 3 | Depth | Comparison questions working | |
| 154 | +| 4 | Product | End-user can run `quantcpp rlv` | |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Fixed Decisions |
| 159 | + |
| 160 | +| Decision | Rationale | |
| 161 | +|----------|-----------| |
| 162 | +| English only | Focus on methodology, not multilingual tokenization | |
| 163 | +| Phi-3.5-mini Q8_0 | Best speed/quality ratio (32K vocab, 6.5 tok/s) | |
| 164 | +| CPU only | Democratization — runs on any laptop | |
| 165 | +| quant.h unified server | Proven correct (no libturboquant sync bugs) | |
0 commit comments