Commit a64e8de
analysis: Qwen3.5-4B scored 2/7 on Acme — Phi-3.5 remains best for RLV
Tested Qwen3.5-4B (DeltaNet hybrid) as RLV backbone:
Acme: 2/7 (vs Phi-3.5 7/7)
Cause: DeltaNet linear attention layers are weak at exact fact extraction
Example: Q2 "Who was CTO?" → returns general HR info, not "Maria Santos"
Phi-3.5-mini restored as RLV default. Its standard attention + small vocab
(32K) gives the best combination for document QA:
- Precise fact extraction (7/7 Acme, 19/20 large-doc)
- Reasonable speed (~6.5 tok/s Q8)
Also:
- Added <think>/</think> token filtering in unified server
- System prompt: removed /no_think (Qwen3.5 doesn't support it)
Key insight: "best benchmark model" ≠ "best for specific task".
Qwen3.5-4B has superior benchmarks (MMLU, GSM8K) but Phi-3.5's
dense attention is better at the token-level precision that
RLV's lookup stage requires.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent e12fcbd commit a64e8de
2 files changed
Lines changed: 5 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
212 | | - | |
| 212 | + | |
213 | 213 | | |
214 | 214 | | |
215 | 215 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
260 | 260 | | |
261 | 261 | | |
262 | 262 | | |
263 | | - | |
| 263 | + | |
| 264 | + | |
264 | 265 | | |
265 | 266 | | |
266 | 267 | | |
| |||
298 | 299 | | |
299 | 300 | | |
300 | 301 | | |
301 | | - | |
| 302 | + | |
| 303 | + | |
302 | 304 | | |
303 | 305 | | |
304 | 306 | | |
| |||
0 commit comments