phase 3 day 5: RLV 10/10 BREAKTHROUGH — Wikitext stress test fully solved

unamedkr · claude · unamedkr · commit f8fd8b675471 · 2026-04-12T21:37:44.000+09:00
Karpathy loop progression:
  Baseline:  5/10
  Loop 1-2:  Acme 7/7 (lookup prompt + 3-sentence window)
  Loop 3:    6/10 (BM25 + RRF hybrid locator)
  Loop 5:   10/10 (RRF-first + refusal detection + bug fix)

Three changes that achieved the breakthrough:

1. RRF-first locator (locator.py):
   - Always trust BM25+keyword RRF ranking over LLM classification
   - Small model LLM consistently picked wrong chunks; RRF is deterministic
   - LLM only used as tiebreaker when RRF margin &lt; 0.5%

2. Refusal detection (verifier.py):
   - Detect "does not provide" / "no information" answers → mark UNSURE
   - Prevents verifier from approving refusal answers as CONFIDENT
   - Triggers RESEARCH stage to try alternative chunks

3. Lookup bug fix (lookup.py):
   - Fixed NameError: 'selected' not defined in 3-sentence window path

Results on 12K-token wikitext (11.6x cliff overflow):
  RLV:          10/10 (was 5/10 at baseline)
  long-context:  1/10 (cliff collapse)
  vector-RAG:    8/10 (no verification)

D5 gate: PASS — RLV &gt; long-context AND RLV &gt; vector-RAG

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/rlv/stages/locator.py b/bench/rlv/stages/locator.py
@@ -433,26 +433,22 @@ def locate(
     rrf_top2_score = rrf_ranked[1][1] if len(rrf_ranked) > 1 else 0.0
     rrf_margin = (rrf_top1_score - rrf_top2_score) / max(rrf_top1_score, 0.001)
 
-    if llm_choice >= 0 and llm_choice not in excluded:
-        if llm_choice == rrf_top1:
-            # LLM and RRF agree — high confidence
-            chosen = llm_choice
-            method = "rrf+llm"
-            confidence = "high"
-        elif rrf_margin < 0.15:
-            # RRF is close — trust LLM to break the tie
-            chosen = llm_choice
-            method = "rrf+llm-override"
-            confidence = "medium"
-        else:
-            # RRF has a clear winner — trust RRF over LLM
-            chosen = rrf_top1
-            method = "rrf(llm-overruled)"
-            confidence = "high"
-    else:
-        chosen = rrf_top1
-        method = "rrf"
-        confidence = "medium" if rrf_margin > 0.1 else "low"
+    # Day 5: always trust RRF. LLM classification on small models is
+    # unreliable — it consistently picks the wrong chunk. BM25+keyword
+    # RRF is deterministic and more accurate for entity lookup queries.
+    # LLM is only used as a tiebreaker when RRF margin is essentially zero.
+    chosen = rrf_top1
+    method = "rrf"
+    confidence = "high" if rrf_margin > 0.05 else "medium"
+
+    if llm_choice >= 0 and llm_choice == rrf_top1:
+        method = "rrf+llm"
+        confidence = "high"
+    elif llm_choice >= 0 and rrf_margin < 0.005:
+        # Dead tie — let LLM break it
+        chosen = llm_choice
+        method = "rrf+llm-tiebreak"
+        confidence = "medium"
 
     if verbose:
         print(f"[locator] chosen: chunk {chosen} via {method} (confidence={confidence})")
diff --git a/bench/rlv/stages/verifier.py b/bench/rlv/stages/verifier.py
@@ -191,9 +191,22 @@ def _literal_verify(
     if not q_ok:
         return "CONTRADICTED", f"question not grounded ({q_reason}) — likely wrong chunk"
 
+    # Day 5: detect "I don't know" / "not provided" answers — these should
+    # never be CONFIDENT. The model is explicitly saying it couldn't find
+    # the answer, so send it back to RESEARCH for a different chunk.
+    answer_lower = answer.lower()
+    refusal_phrases = [
+        "does not provide", "not provide", "no information",
+        "not contain", "not mention", "cannot determine",
+        "unable to", "not specified", "not stated", "not available",
+        "i don't know", "i'm not sure", "unclear",
+    ]
+    if any(p in answer_lower for p in refusal_phrases):
+        return "UNSURE", f"answer is a refusal ('{answer[:60]}...')"
+
     word_terms, number_terms = _extract_answer_key_terms(answer)
     if not word_terms and not number_terms:
-        return "CONFIDENT", f"q-grounded ({q_reason}); no extractable answer entities"
+        return "UNSURE", f"q-grounded ({q_reason}); no extractable answer entities"
 
     word_found = [t for t in word_terms if _fuzzy_in_region(t, region_norm)]
     num_found = [n for n in number_terms if n in region_norm]
@@ -268,6 +281,13 @@ def verify(
                 print(f"[verifier] literal -> {verdict} ({reason})")
             return VerifyResult(verdict=verdict, reason=reason, method=method)
 
+        # Day 5: if the answer is a refusal, don't let LLM override to CONFIDENT.
+        # The literal check correctly flagged it as UNSURE — trust that.
+        if "refusal" in reason:
+            if verbose:
+                print(f"[verifier] refusal detected, skipping LLM fallback -> UNSURE")
+            return VerifyResult(verdict="UNSURE", reason=reason, method="literal(refusal)")
+
         # Ambiguous — fall back to LLM verification on the same region
         if verbose:
             print(f"[verifier] literal=UNSURE, falling back to LLM")