Skip to content

Commit 296886b

Browse files
unamedkrclaude
andcommitted
fix(rlv): KV cache pre-build now saves full prefill (#83)
Fixed: use quant_chat() instead of quant_generate() for prefill. quant_generate() creates and discards internal state (1 token saved). quant_chat() preserves KV state in ctx (57+ tokens saved). Results: Before: "saved 1 tokens" → load_context has empty context After: "saved 57 tokens" → load_context restores full chunk Speed: 4.5s per query (vs 15s regular lookup) ✅ Remaining issue: loaded context + quant_chat question generates inaccurate answers. Root cause: quant_chat's text-prefix matching doesn't recognize loaded context (cached_text is empty after load_context). Needs quant.h internal fix to sync cached_text with loaded KV state. Refs #83 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d9da684 commit 296886b

1 file changed

Lines changed: 12 additions & 7 deletions

File tree

bench/rlv/stages/kv_cache.py

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -116,17 +116,21 @@ def build_kv_cache(
116116
cfg = _QuantConfig()
117117
cfg.temperature = 0.0
118118
cfg.top_p = 1.0
119-
cfg.max_tokens = 1 # prefill only, generate 1 token
119+
cfg.max_tokens = 1 # generate 1 token after prefill
120120
cfg.n_threads = n_threads
121121

122122
ctx = _lib.quant_new(_model, ctypes.byref(cfg))
123123
if not ctx:
124124
continue
125125

126-
# Prefill: process chunk text through the model
127-
_lib.quant_generate(ctx, text.encode("utf-8"), null_cb, None)
126+
# Use quant_chat (NOT quant_generate) for prefill.
127+
# quant_generate creates internal state and discards it.
128+
# quant_chat preserves KV state in ctx → save_context works.
129+
_lib.quant_chat.argtypes = [ctypes.c_void_p, ctypes.c_char_p, _ON_TOKEN, ctypes.c_void_p]
130+
_lib.quant_chat.restype = ctypes.c_int
131+
_lib.quant_chat(ctx, text.encode("utf-8"), null_cb, None)
128132

129-
# Save KV state
133+
# Save KV state — now contains ALL prefill tokens
130134
rc = _lib.quant_save_context(ctx, cache_file.encode())
131135
_lib.quant_free_ctx(ctx)
132136

@@ -182,9 +186,10 @@ def lookup_with_cache(
182186
_lib.quant_free_ctx(ctx)
183187
return None
184188

185-
# Use quant_chat to APPEND question to existing KV cache
186-
# (quant_generate would RESET the cache)
187-
prompt = f"\nQuestion: {question}\nIf the text above answers this question, reply ANSWER: <answer>. If not, reply NONE."
189+
# Append question to existing KV context via quant_chat.
190+
# The chunk text is already in the KV cache from prefill.
191+
# We only send the question — no need to repeat the document.
192+
prompt = f"\n\nBased on the text above, answer this question in one sentence.\nQuestion: {question}\nAnswer:"
188193
tokens = []
189194

190195
def on_token(text_ptr, ud):

0 commit comments

Comments
 (0)