phase A-2: large doc 15/20 (75%) + speed 6.5x improvement

unamedkr · claude · unamedkr · commit 93ad892a48ca · 2026-04-12T23:43:22.000+09:00
Large document stress test (1.3MB, 2754 chunks, 20 questions):
  Result: 15/20 (75%) — Grade B (good, locator improvements needed)
  Speed:  37s/question (was 240s — 6.5x faster)
  Total:  12.3 minutes for 20 questions

Speed optimizations applied:
  - Locator: removed LLM call, pure BM25+keyword RRF (-15s/q)
  - Lookup: max_tokens 64→32 (-5s/q)
  - Verifier: max_tokens 24→8 (-3s/q)
  - Default: max_tokens 64→32, timeout 600→300s
  - Research: max_retries 3→2

Failures (5/20):
  Q5:  multi-hop reasoning on Ise-class ships
  Q9:  "2006" appears in multiple articles (Kershaw draft year)
  Q10: same Kershaw ambiguity
  Q13: similar hurricane articles confused locator
  Q15: technical term "Stegocephalia" not found by BM25

Added: docs/plan/rlv_roadmap_v2.md — post-breakthrough roadmap

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/rlv/stages/_llm.py b/bench/rlv/stages/_llm.py
@@ -236,7 +236,7 @@ def _restart_server_if_dead(model: str | Path = DEFAULT_MODEL, verbose: bool = T
 def llm_call(
     prompt: str,
     *,
-    max_tokens: int = 64,
+    max_tokens: int = 32,
     temperature: float = 0.0,
     model: str | Path = DEFAULT_MODEL,
     enforce_budget: bool = True,
@@ -266,7 +266,7 @@ def llm_call(
 
     # Validate max_tokens
     if max_tokens <= 0:
-        max_tokens = 64
+        max_tokens = 32
 
     messages = []
     if system:
@@ -297,7 +297,7 @@ def llm_call(
 
         t0 = time.time()
         try:
-            with urllib.request.urlopen(req, timeout=600) as resp:
+            with urllib.request.urlopen(req, timeout=300) as resp:
                 payload = json.loads(resp.read().decode("utf-8"))
             break  # success
         except urllib.error.HTTPError as e:
diff --git a/bench/rlv/stages/locator.py b/bench/rlv/stages/locator.py
@@ -362,33 +362,21 @@ def locate(
             score=0.0,
         )
 
-    # --- Step 4: LLM classification on top candidates ---
-    # Always run LLM on the top 5 RRF candidates (not just when ambiguous)
-    top_candidates = [cid for cid, _ in rrf_ranked[:5]]
-    llm_choice = _llm_locate(question, gist, excluded, top_candidates, verbose=verbose)
-
+    # --- Step 4: Pure RRF (no LLM call) ---
+    # Phase 1 speed optimization: removed LLM classification entirely.
+    # Loop 5 finding: BM25+keyword RRF is more accurate AND 1000x faster
+    # than LLM classification on small models. LLM consistently picked
+    # wrong chunks; RRF is deterministic and reliable.
+    # Savings: ~15s per locate call (one fewer inference round-trip).
     rrf_top1 = rrf_ranked[0][0]
     rrf_top1_score = rrf_ranked[0][1]
     rrf_top2_score = rrf_ranked[1][1] if len(rrf_ranked) > 1 else 0.0
     rrf_margin = (rrf_top1_score - rrf_top2_score) / max(rrf_top1_score, 0.001)
 
-    # Day 5: always trust RRF. LLM classification on small models is
-    # unreliable — it consistently picks the wrong chunk. BM25+keyword
-    # RRF is deterministic and more accurate for entity lookup queries.
-    # LLM is only used as a tiebreaker when RRF margin is essentially zero.
     chosen = rrf_top1
     method = "rrf"
     confidence = "high" if rrf_margin > 0.05 else "medium"
 
-    if llm_choice >= 0 and llm_choice == rrf_top1:
-        method = "rrf+llm"
-        confidence = "high"
-    elif llm_choice >= 0 and rrf_margin < 0.005:
-        # Dead tie — let LLM break it
-        chosen = llm_choice
-        method = "rrf+llm-tiebreak"
-        confidence = "medium"
-
     if verbose:
         print(f"[locator] chosen: chunk {chosen} via {method} (confidence={confidence})")
 
diff --git a/bench/rlv/stages/lookup.py b/bench/rlv/stages/lookup.py
@@ -147,7 +147,7 @@ def lookup(
             mode = "direct-answer" if len(sentences) > MAX_SENTENCES_FOR_SELECT else "single-sentence"
             print(f"[lookup] chunk {region.chunk_id} ({len(region_text)} chars), "
                   f"{len(sentences)} sentences -> {mode}")
-        result = _llm.llm_call(prompt, max_tokens=64)
+        result = _llm.llm_call(prompt, max_tokens=32)
         if result.is_error:
             return LookupResult(
                 answer=result.text, region_text=region_text,
@@ -189,7 +189,7 @@ def lookup(
         prompt = LOOKUP_QUOTE_FALLBACK_TEMPLATE.format(
             region_text=region_text, question=question,
         )
-        result2 = _llm.llm_call(prompt, max_tokens=64)
+        result2 = _llm.llm_call(prompt, max_tokens=32)
         return LookupResult(
             answer=result2.text.strip(),
             region_text=region_text,
diff --git a/bench/rlv/stages/researcher.py b/bench/rlv/stages/researcher.py
@@ -13,7 +13,7 @@
 from .verifier import VerifyResult
 
 
-MAX_RETRIES = 3
+MAX_RETRIES = 2
 
 
 @dataclass
diff --git a/bench/rlv/stages/verifier.py b/bench/rlv/stages/verifier.py
@@ -218,7 +218,7 @@ def verify(
     *,
     region_text: str = "",
     chunk_id: int | None = None,
-    use_llm_fallback: bool = True,
+    use_llm_fallback: bool = True,  # Keep LLM fallback for accuracy (Q7 needs it)
     verbose: bool = False,
 ) -> VerifyResult:
     """Verify a tentative answer.
@@ -257,7 +257,7 @@ def verify(
             question=question,
             answer=answer,
         )
-        result = _llm.llm_call(prompt, max_tokens=24)
+        result = _llm.llm_call(prompt, max_tokens=8)
         v2, r2 = _parse_llm_verify_response(result.text)
         return VerifyResult(
             verdict=v2,
@@ -275,6 +275,6 @@ def verify(
         question=question,
         answer=answer,
     )
-    result = _llm.llm_call(prompt, max_tokens=24)
+    result = _llm.llm_call(prompt, max_tokens=8)
     verdict, reason = _parse_llm_verify_response(result.text)
     return VerifyResult(verdict=verdict, reason=reason, raw=result.text, method="llm")
diff --git a/docs/plan/rlv_roadmap_v2.md b/docs/plan/rlv_roadmap_v2.md
@@ -0,0 +1,165 @@
+# RLV Roadmap v2 — Post-Breakthrough
+
+> **Date**: 2026-04-12
+> **Model**: Phi-3.5-mini (Q8_0, fixed)
+> **Language**: English (fixed)
+> **Baseline**: 19/20 (95%) on 20-question Wikitext stress test
+> **Key constraint**: speed (currently ~4min/question, target ~30s)
+
+---
+
+## Current State
+
+| Metric | Value |
+|--------|-------|
+| Accuracy | 19/20 (95%) |
+| Speed | ~4 min/question |
+| Model | Phi-3.5-mini Q8_0, 3.8B, CPU |
+| Server | quant-server-unified (quant.h) |
+| Inference | ~6.5 tok/s |
+| Bottleneck | Every question = locator LLM + lookup LLM + verifier LLM = 3+ inference calls |
+
+## Speed Budget Analysis
+
+```
+Current per-question breakdown (~240s):
+  Locator LLM call:     ~15s (classify chunk)
+  Lookup LLM call:      ~20s (extract answer, 64 tokens)
+  Verifier:             ~5s  (literal check, fast)
+  Research retries:     ~200s (0-3 extra locate+lookup cycles)
+
+Without retries (~40s):
+  Locator + Lookup + Verify = ~40s per question
+
+Target: 30s without retries, ~60s with 1 retry
+```
+
+---
+
+## Phase 1: Speed (Week 1) — "4min → 30s"
+
+### 1.1 Eliminate locator LLM call
+**Impact: -15s/question**
+
+The LLM classification was unreliable (Loop 5 finding: RRF-first beats LLM). Remove the LLM call entirely from the locator — use pure BM25+keyword RRF.
+
+```python
+# Before: BM25 + keyword + LLM call (15s)
+# After:  BM25 + keyword only (0.01s)
+```
+
+### 1.2 KV cache pre-build with save_context
+**Impact: -10s on lookup (eliminates prefill)**
+
+Pre-compute KV cache for each chunk during gist stage. At lookup time, `load_context` instead of re-prefilling.
+
+```python
+# During gist build (one-time):
+for chunk in chunks:
+    ctx = quant_new(model, config)
+    quant_generate(ctx, chunk.text, null_callback, null)  # prefill
+    quant_save_context(ctx, f"cache/{chunk.id}.kv")
+
+# During lookup (per-question):
+ctx = quant_new(model, config)
+quant_load_context(ctx, f"cache/{chunk_id}.kv")  # instant
+quant_generate(ctx, question, on_token, data)     # generate only
+```
+
+### 1.3 Reduce max_tokens
+**Impact: -5s per LLM call**
+
+Most answers are <20 tokens. Reduce from 64 to 24 for lookup, 8 for locator.
+
+### 1.4 Parallel-safe server (connection reuse)
+**Impact: -2s (eliminate server restart overhead)**
+
+Keep server running across questions. Current overhead: 0.5s startup per question × 3 calls = 1.5s waste.
+
+### Phase 1 Target
+
+| Metric | Before | After |
+|--------|--------|-------|
+| Locator | 15s (LLM) | **0.01s** (BM25 only) |
+| Lookup | 20s | **10s** (KV cache + fewer tokens) |
+| Verify | 5s | **2s** (literal only, no LLM fallback) |
+| Retries | 200s | **30s** (faster per-retry) |
+| **Total (no retry)** | **40s** | **~12s** |
+| **Total (1 retry)** | **~80s** | **~25s** |
+
+---
+
+## Phase 2: Robustness (Week 2) — "Always correct or honestly uncertain"
+
+### 2.1 Unanswerable question detection
+Test with questions that have NO answer in the document. RLV should return "I don't know" with high confidence.
+
+### 2.2 Long document scaling (100K+ tokens)
+Test with 100K token documents (50+ chunks). Verify locator BM25 accuracy doesn't degrade.
+
+### 2.3 Adversarial questions
+- Misleading questions ("What year did Du Fu visit Paris?" — never happened)
+- Ambiguous questions ("Who is the poet?" — multiple poets in document)
+- Questions spanning multiple chunks
+
+### 2.4 Accuracy regression suite
+Freeze the 20-question Wikitext + 7-question Acme as a CI regression test. Any code change must pass 26/27+.
+
+---
+
+## Phase 3: Depth (Week 3) — "Harder questions"
+
+### 3.1 Comparison questions
+"Compare Du Fu's early career with his later years" → requires reading 2+ chunks and synthesizing.
+
+### 3.2 Query decomposition
+"What was happening in Du Fu's personal life when the An Lushan Rebellion began?" → decompose into:
+1. "When did the An Lushan Rebellion begin?" → 755
+2. "What was Du Fu's situation in 755?" → family, famine
+
+### 3.3 Multi-document RLV
+Feed 3 separate documents. "Compare the careers of Robert Boulter and Du Fu" → cross-document synthesis.
+
+---
+
+## Phase 4: Product (Week 4) — "People can use it"
+
+### 4.1 CLI integration
+```bash
+quantcpp rlv --doc document.txt "What is the main argument?"
+```
+
+### 4.2 HTTP API
+```bash
+quantcpp rlv serve --doc document.txt --port 8080
+curl localhost:8080/ask -d '{"question": "..."}'
+```
+
+### 4.3 Technical report
+"RLV: Cliff-Aware Document QA for Small Language Models"
+- Cliff measurement methodology
+- RLV architecture + Karpathy loop evolution
+- 19/20 result + ablation study
+- Comparison: RLV vs RAG vs Long-context
+
+---
+
+## Success Criteria
+
+| Phase | Gate | Metric |
+|-------|------|--------|
+| 1 | Speed | < 30s/question (no retry) |
+| 2 | Robustness | 90%+ on new 20 questions + 100% unanswerable detection |
+| 3 | Depth | Comparison questions working |
+| 4 | Product | End-user can run `quantcpp rlv` |
+
+---
+
+## Fixed Decisions
+
+| Decision | Rationale |
+|----------|-----------|
+| English only | Focus on methodology, not multilingual tokenization |
+| Phi-3.5-mini Q8_0 | Best speed/quality ratio (32K vocab, 6.5 tok/s) |
+| CPU only | Democratization — runs on any laptop |
+| quant.h unified server | Proven correct (no libturboquant sync bugs) |