Skip to content

Commit 93ad892

Browse files
unamedkrclaude
andcommitted
phase A-2: large doc 15/20 (75%) + speed 6.5x improvement
Large document stress test (1.3MB, 2754 chunks, 20 questions): Result: 15/20 (75%) — Grade B (good, locator improvements needed) Speed: 37s/question (was 240s — 6.5x faster) Total: 12.3 minutes for 20 questions Speed optimizations applied: - Locator: removed LLM call, pure BM25+keyword RRF (-15s/q) - Lookup: max_tokens 64→32 (-5s/q) - Verifier: max_tokens 24→8 (-3s/q) - Default: max_tokens 64→32, timeout 600→300s - Research: max_retries 3→2 Failures (5/20): Q5: multi-hop reasoning on Ise-class ships Q9: "2006" appears in multiple articles (Kershaw draft year) Q10: same Kershaw ambiguity Q13: similar hurricane articles confused locator Q15: technical term "Stegocephalia" not found by BM25 Added: docs/plan/rlv_roadmap_v2.md — post-breakthrough roadmap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e046c95 commit 93ad892

6 files changed

Lines changed: 180 additions & 27 deletions

File tree

bench/rlv/stages/_llm.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ def _restart_server_if_dead(model: str | Path = DEFAULT_MODEL, verbose: bool = T
236236
def llm_call(
237237
prompt: str,
238238
*,
239-
max_tokens: int = 64,
239+
max_tokens: int = 32,
240240
temperature: float = 0.0,
241241
model: str | Path = DEFAULT_MODEL,
242242
enforce_budget: bool = True,
@@ -266,7 +266,7 @@ def llm_call(
266266

267267
# Validate max_tokens
268268
if max_tokens <= 0:
269-
max_tokens = 64
269+
max_tokens = 32
270270

271271
messages = []
272272
if system:
@@ -297,7 +297,7 @@ def llm_call(
297297

298298
t0 = time.time()
299299
try:
300-
with urllib.request.urlopen(req, timeout=600) as resp:
300+
with urllib.request.urlopen(req, timeout=300) as resp:
301301
payload = json.loads(resp.read().decode("utf-8"))
302302
break # success
303303
except urllib.error.HTTPError as e:

bench/rlv/stages/locator.py

Lines changed: 6 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -362,33 +362,21 @@ def locate(
362362
score=0.0,
363363
)
364364

365-
# --- Step 4: LLM classification on top candidates ---
366-
# Always run LLM on the top 5 RRF candidates (not just when ambiguous)
367-
top_candidates = [cid for cid, _ in rrf_ranked[:5]]
368-
llm_choice = _llm_locate(question, gist, excluded, top_candidates, verbose=verbose)
369-
365+
# --- Step 4: Pure RRF (no LLM call) ---
366+
# Phase 1 speed optimization: removed LLM classification entirely.
367+
# Loop 5 finding: BM25+keyword RRF is more accurate AND 1000x faster
368+
# than LLM classification on small models. LLM consistently picked
369+
# wrong chunks; RRF is deterministic and reliable.
370+
# Savings: ~15s per locate call (one fewer inference round-trip).
370371
rrf_top1 = rrf_ranked[0][0]
371372
rrf_top1_score = rrf_ranked[0][1]
372373
rrf_top2_score = rrf_ranked[1][1] if len(rrf_ranked) > 1 else 0.0
373374
rrf_margin = (rrf_top1_score - rrf_top2_score) / max(rrf_top1_score, 0.001)
374375

375-
# Day 5: always trust RRF. LLM classification on small models is
376-
# unreliable — it consistently picks the wrong chunk. BM25+keyword
377-
# RRF is deterministic and more accurate for entity lookup queries.
378-
# LLM is only used as a tiebreaker when RRF margin is essentially zero.
379376
chosen = rrf_top1
380377
method = "rrf"
381378
confidence = "high" if rrf_margin > 0.05 else "medium"
382379

383-
if llm_choice >= 0 and llm_choice == rrf_top1:
384-
method = "rrf+llm"
385-
confidence = "high"
386-
elif llm_choice >= 0 and rrf_margin < 0.005:
387-
# Dead tie — let LLM break it
388-
chosen = llm_choice
389-
method = "rrf+llm-tiebreak"
390-
confidence = "medium"
391-
392380
if verbose:
393381
print(f"[locator] chosen: chunk {chosen} via {method} (confidence={confidence})")
394382

bench/rlv/stages/lookup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ def lookup(
147147
mode = "direct-answer" if len(sentences) > MAX_SENTENCES_FOR_SELECT else "single-sentence"
148148
print(f"[lookup] chunk {region.chunk_id} ({len(region_text)} chars), "
149149
f"{len(sentences)} sentences -> {mode}")
150-
result = _llm.llm_call(prompt, max_tokens=64)
150+
result = _llm.llm_call(prompt, max_tokens=32)
151151
if result.is_error:
152152
return LookupResult(
153153
answer=result.text, region_text=region_text,
@@ -189,7 +189,7 @@ def lookup(
189189
prompt = LOOKUP_QUOTE_FALLBACK_TEMPLATE.format(
190190
region_text=region_text, question=question,
191191
)
192-
result2 = _llm.llm_call(prompt, max_tokens=64)
192+
result2 = _llm.llm_call(prompt, max_tokens=32)
193193
return LookupResult(
194194
answer=result2.text.strip(),
195195
region_text=region_text,

bench/rlv/stages/researcher.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
from .verifier import VerifyResult
1414

1515

16-
MAX_RETRIES = 3
16+
MAX_RETRIES = 2
1717

1818

1919
@dataclass

bench/rlv/stages/verifier.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ def verify(
218218
*,
219219
region_text: str = "",
220220
chunk_id: int | None = None,
221-
use_llm_fallback: bool = True,
221+
use_llm_fallback: bool = True, # Keep LLM fallback for accuracy (Q7 needs it)
222222
verbose: bool = False,
223223
) -> VerifyResult:
224224
"""Verify a tentative answer.
@@ -257,7 +257,7 @@ def verify(
257257
question=question,
258258
answer=answer,
259259
)
260-
result = _llm.llm_call(prompt, max_tokens=24)
260+
result = _llm.llm_call(prompt, max_tokens=8)
261261
v2, r2 = _parse_llm_verify_response(result.text)
262262
return VerifyResult(
263263
verdict=v2,
@@ -275,6 +275,6 @@ def verify(
275275
question=question,
276276
answer=answer,
277277
)
278-
result = _llm.llm_call(prompt, max_tokens=24)
278+
result = _llm.llm_call(prompt, max_tokens=8)
279279
verdict, reason = _parse_llm_verify_response(result.text)
280280
return VerifyResult(verdict=verdict, reason=reason, raw=result.text, method="llm")

docs/plan/rlv_roadmap_v2.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# RLV Roadmap v2 — Post-Breakthrough
2+
3+
> **Date**: 2026-04-12
4+
> **Model**: Phi-3.5-mini (Q8_0, fixed)
5+
> **Language**: English (fixed)
6+
> **Baseline**: 19/20 (95%) on 20-question Wikitext stress test
7+
> **Key constraint**: speed (currently ~4min/question, target ~30s)
8+
9+
---
10+
11+
## Current State
12+
13+
| Metric | Value |
14+
|--------|-------|
15+
| Accuracy | 19/20 (95%) |
16+
| Speed | ~4 min/question |
17+
| Model | Phi-3.5-mini Q8_0, 3.8B, CPU |
18+
| Server | quant-server-unified (quant.h) |
19+
| Inference | ~6.5 tok/s |
20+
| Bottleneck | Every question = locator LLM + lookup LLM + verifier LLM = 3+ inference calls |
21+
22+
## Speed Budget Analysis
23+
24+
```
25+
Current per-question breakdown (~240s):
26+
Locator LLM call: ~15s (classify chunk)
27+
Lookup LLM call: ~20s (extract answer, 64 tokens)
28+
Verifier: ~5s (literal check, fast)
29+
Research retries: ~200s (0-3 extra locate+lookup cycles)
30+
31+
Without retries (~40s):
32+
Locator + Lookup + Verify = ~40s per question
33+
34+
Target: 30s without retries, ~60s with 1 retry
35+
```
36+
37+
---
38+
39+
## Phase 1: Speed (Week 1) — "4min → 30s"
40+
41+
### 1.1 Eliminate locator LLM call
42+
**Impact: -15s/question**
43+
44+
The LLM classification was unreliable (Loop 5 finding: RRF-first beats LLM). Remove the LLM call entirely from the locator — use pure BM25+keyword RRF.
45+
46+
```python
47+
# Before: BM25 + keyword + LLM call (15s)
48+
# After: BM25 + keyword only (0.01s)
49+
```
50+
51+
### 1.2 KV cache pre-build with save_context
52+
**Impact: -10s on lookup (eliminates prefill)**
53+
54+
Pre-compute KV cache for each chunk during gist stage. At lookup time, `load_context` instead of re-prefilling.
55+
56+
```python
57+
# During gist build (one-time):
58+
for chunk in chunks:
59+
ctx = quant_new(model, config)
60+
quant_generate(ctx, chunk.text, null_callback, null) # prefill
61+
quant_save_context(ctx, f"cache/{chunk.id}.kv")
62+
63+
# During lookup (per-question):
64+
ctx = quant_new(model, config)
65+
quant_load_context(ctx, f"cache/{chunk_id}.kv") # instant
66+
quant_generate(ctx, question, on_token, data) # generate only
67+
```
68+
69+
### 1.3 Reduce max_tokens
70+
**Impact: -5s per LLM call**
71+
72+
Most answers are <20 tokens. Reduce from 64 to 24 for lookup, 8 for locator.
73+
74+
### 1.4 Parallel-safe server (connection reuse)
75+
**Impact: -2s (eliminate server restart overhead)**
76+
77+
Keep server running across questions. Current overhead: 0.5s startup per question × 3 calls = 1.5s waste.
78+
79+
### Phase 1 Target
80+
81+
| Metric | Before | After |
82+
|--------|--------|-------|
83+
| Locator | 15s (LLM) | **0.01s** (BM25 only) |
84+
| Lookup | 20s | **10s** (KV cache + fewer tokens) |
85+
| Verify | 5s | **2s** (literal only, no LLM fallback) |
86+
| Retries | 200s | **30s** (faster per-retry) |
87+
| **Total (no retry)** | **40s** | **~12s** |
88+
| **Total (1 retry)** | **~80s** | **~25s** |
89+
90+
---
91+
92+
## Phase 2: Robustness (Week 2) — "Always correct or honestly uncertain"
93+
94+
### 2.1 Unanswerable question detection
95+
Test with questions that have NO answer in the document. RLV should return "I don't know" with high confidence.
96+
97+
### 2.2 Long document scaling (100K+ tokens)
98+
Test with 100K token documents (50+ chunks). Verify locator BM25 accuracy doesn't degrade.
99+
100+
### 2.3 Adversarial questions
101+
- Misleading questions ("What year did Du Fu visit Paris?" — never happened)
102+
- Ambiguous questions ("Who is the poet?" — multiple poets in document)
103+
- Questions spanning multiple chunks
104+
105+
### 2.4 Accuracy regression suite
106+
Freeze the 20-question Wikitext + 7-question Acme as a CI regression test. Any code change must pass 26/27+.
107+
108+
---
109+
110+
## Phase 3: Depth (Week 3) — "Harder questions"
111+
112+
### 3.1 Comparison questions
113+
"Compare Du Fu's early career with his later years" → requires reading 2+ chunks and synthesizing.
114+
115+
### 3.2 Query decomposition
116+
"What was happening in Du Fu's personal life when the An Lushan Rebellion began?" → decompose into:
117+
1. "When did the An Lushan Rebellion begin?" → 755
118+
2. "What was Du Fu's situation in 755?" → family, famine
119+
120+
### 3.3 Multi-document RLV
121+
Feed 3 separate documents. "Compare the careers of Robert Boulter and Du Fu" → cross-document synthesis.
122+
123+
---
124+
125+
## Phase 4: Product (Week 4) — "People can use it"
126+
127+
### 4.1 CLI integration
128+
```bash
129+
quantcpp rlv --doc document.txt "What is the main argument?"
130+
```
131+
132+
### 4.2 HTTP API
133+
```bash
134+
quantcpp rlv serve --doc document.txt --port 8080
135+
curl localhost:8080/ask -d '{"question": "..."}'
136+
```
137+
138+
### 4.3 Technical report
139+
"RLV: Cliff-Aware Document QA for Small Language Models"
140+
- Cliff measurement methodology
141+
- RLV architecture + Karpathy loop evolution
142+
- 19/20 result + ablation study
143+
- Comparison: RLV vs RAG vs Long-context
144+
145+
---
146+
147+
## Success Criteria
148+
149+
| Phase | Gate | Metric |
150+
|-------|------|--------|
151+
| 1 | Speed | < 30s/question (no retry) |
152+
| 2 | Robustness | 90%+ on new 20 questions + 100% unanswerable detection |
153+
| 3 | Depth | Comparison questions working |
154+
| 4 | Product | End-user can run `quantcpp rlv` |
155+
156+
---
157+
158+
## Fixed Decisions
159+
160+
| Decision | Rationale |
161+
|----------|-----------|
162+
| English only | Focus on methodology, not multilingual tokenization |
163+
| Phi-3.5-mini Q8_0 | Best speed/quality ratio (32K vocab, 6.5 tok/s) |
164+
| CPU only | Democratization — runs on any laptop |
165+
| quant.h unified server | Proven correct (no libturboquant sync bugs) |

0 commit comments

Comments
 (0)