Skip to content

Commit ae1365e

Browse files
unamedkrclaude
andcommitted
phase 3 day 5: expand wikitext to 20 questions → RLV 19/20
Phase A robustness verification: doubled the test set from 10 to 20. New questions (Q11-Q20): Q11: Boulter's Casualty character → PASS (Kieron Fletcher) Q12: How to Curse theatre → PASS (Bush Theatre) Q13: Boulter's character in Blackburn's film → PASS (Sean) [multi-hop] Q14: Du Fu birth year → PASS (712) Q15: Du Fu's children count → PASS (five) Q16: Du Fu's Sichuan destination → PASS (Chengdu) Q17: Haiku poet influenced by Du Fu → PASS (Matsuo Basho) Q18: Du Fu "poet historian" epithet → FAIL [multi-hop] Q19: Kiss You Billboard Hot 100 peak → PASS (46) Q20: The Independent on Simon Cowell → PASS [multi-hop] Results: Original 10: 10/10 (maintained) New 10: 9/10 (Q18 failed — "poet historian" epithet not found) Total: 19/20 (95%) The single failure (Q18) is a difficult multi-hop question requiring the model to understand that a Chinese literary criticism term maps to the English translation "poet historian". This is a translation/ cultural knowledge gap, not a retrieval failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 61d8eea commit ae1365e

1 file changed

Lines changed: 58 additions & 0 deletions

File tree

bench/rlv/eval/eval_wikitext.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,64 @@
106106
"question": "Who directed the Kiss You music video?",
107107
"fragments": ["vaughan arnell", "arnell"],
108108
},
109+
110+
# === Phase A expanded set: 10 more questions (Q11-Q20) ===
111+
112+
# Boulter — deeper facts
113+
{
114+
"id": 11, "topic": "boulter", "type": "single-hop",
115+
"question": "What character did Boulter play in the TV series Casualty?",
116+
"fragments": ["kieron fletcher", "kieron", "fletcher"],
117+
},
118+
{
119+
"id": 12, "topic": "boulter", "type": "single-hop",
120+
"question": "At which theatre was the play How to Curse performed?",
121+
"fragments": ["bush theatre", "bush"],
122+
},
123+
{
124+
"id": 13, "topic": "boulter", "type": "multi-hop",
125+
"question": "What character did Boulter play in the film directed by Olly Blackburn?",
126+
"fragments": ["sean"],
127+
},
128+
129+
# Du Fu — deeper facts
130+
{
131+
"id": 14, "topic": "dufu", "type": "single-hop",
132+
"question": "In what year was Du Fu born?",
133+
"fragments": ["712"],
134+
},
135+
{
136+
"id": 15, "topic": "dufu", "type": "single-hop",
137+
"question": "How many children did Du Fu have by 757?",
138+
"fragments": ["five", "5"],
139+
},
140+
{
141+
"id": 16, "topic": "dufu", "type": "single-hop",
142+
"question": "Which city did Du Fu move to in December, fleeing to Sichuan province?",
143+
"fragments": ["chengdu"],
144+
},
145+
{
146+
"id": 17, "topic": "dufu", "type": "single-hop",
147+
"question": "Which famous haiku poet was strongly influenced by Du Fu?",
148+
"fragments": ["matsuo basho", "basho", "bash"],
149+
},
150+
{
151+
"id": 18, "topic": "dufu", "type": "multi-hop",
152+
"question": "What two-word epithet meaning 'poet historian' did Chinese critics give Du Fu?",
153+
"fragments": ["poet historian", "historian"],
154+
},
155+
156+
# Kiss You — deeper facts
157+
{
158+
"id": 19, "topic": "kiss_you", "type": "single-hop",
159+
"question": "At what number did Kiss You peak on the US Billboard Hot 100?",
160+
"fragments": ["46"],
161+
},
162+
{
163+
"id": 20, "topic": "kiss_you", "type": "multi-hop",
164+
"question": "According to The Independent, who reported that Simon Cowell wanted Kiss You as the lead single?",
165+
"fragments": ["independent"],
166+
},
109167
]
110168

111169

0 commit comments

Comments
 (0)