Skip to content

Commit a64e8de

Browse files
unamedkrclaude
andcommitted
analysis: Qwen3.5-4B scored 2/7 on Acme — Phi-3.5 remains best for RLV
Tested Qwen3.5-4B (DeltaNet hybrid) as RLV backbone: Acme: 2/7 (vs Phi-3.5 7/7) Cause: DeltaNet linear attention layers are weak at exact fact extraction Example: Q2 "Who was CTO?" → returns general HR info, not "Maria Santos" Phi-3.5-mini restored as RLV default. Its standard attention + small vocab (32K) gives the best combination for document QA: - Precise fact extraction (7/7 Acme, 19/20 large-doc) - Reasonable speed (~6.5 tok/s Q8) Also: - Added <think>/<​/think> token filtering in unified server - System prompt: removed /no_think (Qwen3.5 doesn't support it) Key insight: "best benchmark model" ≠ "best for specific task". Qwen3.5-4B has superior benchmarks (MMLU, GSM8K) but Phi-3.5's dense attention is better at the token-level precision that RLV's lookup stage requires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e12fcbd commit a64e8de

2 files changed

Lines changed: 5 additions & 3 deletions

File tree

bench/rlv/stages/_llm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ def stop_server():
209209
# reasoning chains in chat mode. Verified with the Acme test doc:
210210
# without this, the model picks the first entity (primacy bias);
211211
# with this, it correctly identifies the requested role.
212-
DEFAULT_SYSTEM_PROMPT = "/no_think\nAnswer in one short sentence. No reasoning steps."
212+
DEFAULT_SYSTEM_PROMPT = "Answer in one short sentence. No reasoning steps."
213213

214214

215215
MAX_LLM_RETRIES = 2 # retry once on transient server errors

tools/quant_server_unified.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,8 @@ static void stream_on_token(const char* text, void* user_data) {
260260
strstr(text, "<|endoftext|>") ||
261261
strstr(text, "<start_of_turn>") || strstr(text, "<end_of_turn>") ||
262262
strstr(text, "<|turn>") || strstr(text, "<turn|>") ||
263-
strstr(text, "<|think|>") || strstr(text, "<|channel>") ||
263+
strstr(text, "<|think|>") || strstr(text, "<think>") ||
264+
strstr(text, "</think>") || strstr(text, "<|channel>") ||
264265
strstr(text, "<eos>")) return;
265266

266267
/* JSON-escape the token */
@@ -298,7 +299,8 @@ static void collect_on_token(const char* text, void* user_data) {
298299
strstr(text, "<|endoftext|>") ||
299300
strstr(text, "<start_of_turn>") || strstr(text, "<end_of_turn>") ||
300301
strstr(text, "<|turn>") || strstr(text, "<turn|>") ||
301-
strstr(text, "<|think|>") || strstr(text, "<|channel>") ||
302+
strstr(text, "<|think|>") || strstr(text, "<think>") ||
303+
strstr(text, "</think>") || strstr(text, "<|channel>") ||
302304
strstr(text, "<eos>")) return;
303305

304306
size_t tlen = strlen(text);

0 commit comments

Comments
 (0)