analysis: Qwen3.5-4B scored 2/7 on Acme — Phi-3.5 remains best for RLV

unamedkr · claude · unamedkr · commit a64e8dee0efc · 2026-04-13T17:48:15.000+09:00
Tested Qwen3.5-4B (DeltaNet hybrid) as RLV backbone:
  Acme: 2/7 (vs Phi-3.5 7/7)
  Cause: DeltaNet linear attention layers are weak at exact fact extraction
  Example: Q2 "Who was CTO?" → returns general HR info, not "Maria Santos"

Phi-3.5-mini restored as RLV default. Its standard attention + small vocab
(32K) gives the best combination for document QA:
  - Precise fact extraction (7/7 Acme, 19/20 large-doc)
  - Reasonable speed (~6.5 tok/s Q8)

Also:
  - Added &lt;think&gt;/&lt;​/think&gt; token filtering in unified server
  - System prompt: removed /no_think (Qwen3.5 doesn't support it)

Key insight: "best benchmark model" ≠ "best for specific task".
Qwen3.5-4B has superior benchmarks (MMLU, GSM8K) but Phi-3.5's
dense attention is better at the token-level precision that
RLV's lookup stage requires.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/rlv/stages/_llm.py b/bench/rlv/stages/_llm.py
@@ -209,7 +209,7 @@ def stop_server():
 # reasoning chains in chat mode. Verified with the Acme test doc:
 # without this, the model picks the first entity (primacy bias);
 # with this, it correctly identifies the requested role.
-DEFAULT_SYSTEM_PROMPT = "/no_think\nAnswer in one short sentence. No reasoning steps."
+DEFAULT_SYSTEM_PROMPT = "Answer in one short sentence. No reasoning steps."
 
 
 MAX_LLM_RETRIES = 2  # retry once on transient server errors
diff --git a/tools/quant_server_unified.c b/tools/quant_server_unified.c
@@ -260,7 +260,8 @@ static void stream_on_token(const char* text, void* user_data) {
         strstr(text, "<|endoftext|>") ||
         strstr(text, "<start_of_turn>") || strstr(text, "<end_of_turn>") ||
         strstr(text, "<|turn>") || strstr(text, "<turn|>") ||
-        strstr(text, "<|think|>") || strstr(text, "<|channel>") ||
+        strstr(text, "<|think|>") || strstr(text, "<think>") ||
+        strstr(text, "</think>") || strstr(text, "<|channel>") ||
         strstr(text, "<eos>")) return;
 
     /* JSON-escape the token */
@@ -298,7 +299,8 @@ static void collect_on_token(const char* text, void* user_data) {
         strstr(text, "<|endoftext|>") ||
         strstr(text, "<start_of_turn>") || strstr(text, "<end_of_turn>") ||
         strstr(text, "<|turn>") || strstr(text, "<turn|>") ||
-        strstr(text, "<|think|>") || strstr(text, "<|channel>") ||
+        strstr(text, "<|think|>") || strstr(text, "<think>") ||
+        strstr(text, "</think>") || strstr(text, "<|channel>") ||
         strstr(text, "<eos>")) return;
 
     size_t tlen = strlen(text);