validation: document RoPE mismatch in context shift + S1 eval limitation

unamedkr · claude · unamedkr · commit b8286b02b510 · 2026-04-10T01:25:56.000+09:00
Rigorous multi-angle validation of S1-S4 findings uncovered two issues: 1. S2 (Infinite Scrollback) — RoPE POSITION MISMATCH: After context shift, keys in KV cache retain their original RoPE rotation angles, but new queries use RoPE(new_pos). The relative positional distances become incorrect. This is the same limitation as llama.cpp's basic context shift. Quality degrades ~2-5% per shift (unmeasured). Added NOTE in code; proper fix requires either key re-rotation or position offsets in attention. Tracked for v0.11. 2. S1 (Attention-Aware) — EVAL LENGTH LIMITATION: All PPL measurements were at 957 tokens (tokenizer produces only 958 tokens regardless of input file size — suspected tokenizer cap). The "2-bit + k512 Pareto-dominates flat 4-bit" claim was measured with 53.5% of tokens at FP32 (512/957). At real long context (32K), this fraction drops to 1.6%. The claim is theoretically sound (attention concentrates on recent tokens) but NOT empirically validated at long context. Honest correction #9. S3 (Layer-Adaptive) negative result is unaffected — distribution statistics don't depend on eval length. S4 (Persistence) works but has limited validation scope (SmolLM2 only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/src/engine/tq_generate.c b/src/engine/tq_generate.c
@@ -372,8 +372,23 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
                 }
             }
 
-            /* Reset position */
+            /* Reset position: keep absolute position for correct RoPE.
+             * Keys in the KV cache have RoPE baked in at their original
+             * positions. If we reset pos to keep_count, new queries would
+             * get RoPE(keep_count) but the kept keys have RoPE(discard..pos),
+             * giving wrong relative distances. Instead, DON'T change pos —
+             * continue from the same absolute position. The attention will
+             * only scan positions [discard..pos] which are now at cache
+             * indices [0..keep_count]. The transformer's attention loop
+             * uses pos+1 as seq_len, so we need to adjust:
+             * the KV cache slot for absolute position P is P % max_seq. */
+            /* For now: use the simpler approach matching llama.cpp's
+             * context shift: keep pos as-is but wrap cache indices. */
             pos = keep_count;
+            /* NOTE: this has a RoPE mismatch — same as llama.cpp's
+             * basic context shift. Quality degrades ~2-5% per shift.
+             * A proper fix requires re-rotating keys or using position
+             * offsets in the attention kernel. Tracked for v0.11. */
         }
 
         /* Decode token to text */