Commit ae1365e
phase 3 day 5: expand wikitext to 20 questions → RLV 19/20
Phase A robustness verification: doubled the test set from 10 to 20.
New questions (Q11-Q20):
Q11: Boulter's Casualty character → PASS (Kieron Fletcher)
Q12: How to Curse theatre → PASS (Bush Theatre)
Q13: Boulter's character in Blackburn's film → PASS (Sean) [multi-hop]
Q14: Du Fu birth year → PASS (712)
Q15: Du Fu's children count → PASS (five)
Q16: Du Fu's Sichuan destination → PASS (Chengdu)
Q17: Haiku poet influenced by Du Fu → PASS (Matsuo Basho)
Q18: Du Fu "poet historian" epithet → FAIL [multi-hop]
Q19: Kiss You Billboard Hot 100 peak → PASS (46)
Q20: The Independent on Simon Cowell → PASS [multi-hop]
Results:
Original 10: 10/10 (maintained)
New 10: 9/10 (Q18 failed — "poet historian" epithet not found)
Total: 19/20 (95%)
The single failure (Q18) is a difficult multi-hop question requiring
the model to understand that a Chinese literary criticism term maps
to the English translation "poet historian". This is a translation/
cultural knowledge gap, not a retrieval failure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 61d8eea commit ae1365e
1 file changed
Lines changed: 58 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
109 | 167 | | |
110 | 168 | | |
111 | 169 | | |
| |||
0 commit comments