You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: quant_generate continues from loaded KV cache (#83)
quant_generate now detects loaded KV state (n_ctx_tokens > 0) and
prefills new prompt tokens starting after the loaded context instead
of resetting to position 0.
Implementation:
if (ctx->n_ctx_tokens > 0 && ctx->state != NULL):
// Continue: prefill at positions [n_ctx_tokens, ...]
// Generate from there
else:
// Standard: fresh state via tq_generate
Speed verified: 4.2-4.5s per query (vs 15s without cache).
KV round-trip: 57 tokens saved and restored correctly.
Known issue: answers are related but imprecise (e.g., "847M" → "$10M").
Hypothesis: KV cache precision loss during save/load (FP32 → file → FP32),
or LongRoPE position discontinuity at the save/load boundary.
Needs investigation of save_context's numerical fidelity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments