Commit 296886b
fix(rlv): KV cache pre-build now saves full prefill (#83)
Fixed: use quant_chat() instead of quant_generate() for prefill.
quant_generate() creates and discards internal state (1 token saved).
quant_chat() preserves KV state in ctx (57+ tokens saved).
Results:
Before: "saved 1 tokens" → load_context has empty context
After: "saved 57 tokens" → load_context restores full chunk
Speed: 4.5s per query (vs 15s regular lookup) ✅
Remaining issue: loaded context + quant_chat question generates
inaccurate answers. Root cause: quant_chat's text-prefix matching
doesn't recognize loaded context (cached_text is empty after
load_context). Needs quant.h internal fix to sync cached_text
with loaded KV state.
Refs #83
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent d9da684 commit 296886b
1 file changed
Lines changed: 12 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
119 | | - | |
| 119 | + | |
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
126 | | - | |
127 | | - | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
128 | 132 | | |
129 | | - | |
| 133 | + | |
130 | 134 | | |
131 | 135 | | |
132 | 136 | | |
| |||
182 | 186 | | |
183 | 187 | | |
184 | 188 | | |
185 | | - | |
186 | | - | |
187 | | - | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
188 | 193 | | |
189 | 194 | | |
190 | 195 | | |
| |||
0 commit comments