Skip to content

Commit d06d0bc

Browse files
unamedkrclaude
andcommitted
phase 3 day 5: Phi-3.5 Q8 + lookup improvements → Acme 7/7
Karpathy loop on Acme benchmark with Phi-3.5-mini (Q8_0, unified server): Baseline: 5/7 Loop 1: lookup prompt "DIRECTLY answers" → 6/7 (+Q1, +Q7) Loop 2: 3-sentence window (was 2) → 7/7 (+Q6) Changes: - _llm.py: switch to Q8_0 model (3x faster than Q4_K_M) - lookup.py: stronger select-by-index prompt + 3-sentence context window D3 gate: PASS (7/7, 157s, ~22s/question on M3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e9b4df5 commit d06d0bc

2 files changed

Lines changed: 14 additions & 10 deletions

File tree

bench/rlv/stages/_llm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
# that produced garbage for Phi-3.5/SmolLM2. The unified server compiles
2929
# quant.h as a single translation unit — no sync issues.
3030
# Phi-3.5: ~1.15 tok/s (CPU NEON), ~6.5 tok/s reported in PR #79.
31-
DEFAULT_MODEL = REPO / "models" / "Phi-3.5-mini-instruct-Q4_K_M.gguf"
31+
DEFAULT_MODEL = REPO / "models" / "Phi-3.5-mini-instruct-Q8_0.gguf"
3232
DEFAULT_SERVER_BINARY = REPO / "build_metal" / "quant-server-unified"
3333
DEFAULT_SERVER_HOST = "127.0.0.1"
3434
DEFAULT_SERVER_PORT = 8421 # arbitrary, avoid conflicts with 8080
@@ -41,7 +41,7 @@
4141
CLIFF_BUDGET = {
4242
"models/Llama-3.2-3B-Instruct-Q8_0.gguf": 1024,
4343
"models/Llama-3.2-1B-Instruct-Q8_0.gguf": 512,
44-
"models/Phi-3.5-mini-instruct-Q4_K_M.gguf": 1024,
44+
"models/Phi-3.5-mini-instruct-Q8_0.gguf": 1024,
4545
}
4646

4747

bench/rlv/stages/lookup.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,13 @@
2828

2929
# Day 3 v3: numbered-sentence selection prompt. The model picks an
3030
# integer; we map it back to a verbatim sentence.
31-
LOOKUP_PROMPT_TEMPLATE = """Sentences from the document:
31+
LOOKUP_PROMPT_TEMPLATE = """Read these sentences carefully:
3232
3333
{numbered_sentences}
3434
3535
Question: {question}
3636
37-
Which sentence number contains the answer? Reply with only one digit: the sentence number."""
37+
Which sentence number DIRECTLY answers the question? Pick the sentence that contains the specific fact being asked about. Reply with ONLY the number."""
3838

3939
# Fallback "quote" prompt for chunks with very few sentences (≤1) where
4040
# selection is trivial and we can ask the model directly.
@@ -159,12 +159,16 @@ def lookup(
159159
# in Mercury Fur. He was directed by John Tiffany." — picking either
160160
# sentence alone loses the connection). For Acme-style structured
161161
# docs, the previous sentence is benign extra context.
162-
selected = sentences[idx - 1]
163-
if idx >= 2:
164-
prev = sentences[idx - 2]
165-
answer = f"{prev} {selected}"
166-
else:
167-
answer = selected
162+
# Return a 3-sentence window centered on the selected sentence.
163+
# Multi-hop questions often require context from adjacent sentences
164+
# (e.g., "strategy proposed at what event?" spans sentences about
165+
# the strategy AND the event name in the next sentence).
166+
window = []
167+
for offset in range(-1, 2): # prev, selected, next
168+
i = idx - 1 + offset
169+
if 0 <= i < len(sentences):
170+
window.append(sentences[i])
171+
answer = " ".join(window)
168172
if verbose:
169173
print(f"[lookup] selected sentence {idx}/{len(sentences)}: {selected[:80]!r}")
170174
return LookupResult(

0 commit comments

Comments
 (0)