Commit baabe82
feat(prefill): batched enabled by default — 7.2× end-to-end on default KV mode
The final missing piece: batched prefill now populates quant_key_cache
(via traits->quantize per kv-head per block) and the k_highres_fp32
circular buffer when either is active. This matches baseline's
self_attn_forward K-cache write logic.
Also keeps unconditional FP32 s->key_cache write so the batched path's
own attention loop (which reads FP32 K) works regardless of KV quant
mode. The extra memory is the same size as the already-allocated cache
(trivial on modern systems).
Result: batched auto-activates on ALL supported Llama-family models
under default settings, no `-k fp32` required.
Measured on Apple M1 Pro, 8 threads, ~250-token prompt:
Llama-3.2-1B Q8 (default KV): 42.7s → 5.9s (**7.2× end-to-end**)
Llama-3.2-3B Q8 (default KV): (similar ratio expected)
Output verified bit-identical to per-token baseline on 4 varied prompts
(Tell me about, The capital of France, Hello, Write a story) and on
Llama-3.2-3B Q8. 11/11 STRICT + 6/6 long-seq tests pass.
Batched bail-out conditions now:
- non-standard architecture (MoE, Gemma4, Phi-3 fused QKV, DeltaNet)
- delta KV compression (I/P-frame coding)
- quantized V cache (value_quant_bits > 0)
- kv_shared layers
README v3.2 section updated with new numbers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 59ec23c commit baabe82
3 files changed
Lines changed: 44 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
168 | | - | |
| 168 | + | |
169 | 169 | | |
170 | 170 | | |
171 | 171 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
304 | 304 | | |
305 | 305 | | |
306 | 306 | | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | | - | |
311 | | - | |
312 | | - | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
313 | 311 | | |
314 | | - | |
315 | | - | |
316 | | - | |
| 312 | + | |
317 | 313 | | |
318 | 314 | | |
319 | 315 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3067 | 3067 | | |
3068 | 3068 | | |
3069 | 3069 | | |
| 3070 | + | |
| 3071 | + | |
| 3072 | + | |
3070 | 3073 | | |
3071 | 3074 | | |
3072 | 3075 | | |
| |||
3274 | 3277 | | |
3275 | 3278 | | |
3276 | 3279 | | |
3277 | | - | |
| 3280 | + | |
| 3281 | + | |
| 3282 | + | |
| 3283 | + | |
| 3284 | + | |
| 3285 | + | |
| 3286 | + | |
3278 | 3287 | | |
3279 | 3288 | | |
| 3289 | + | |
| 3290 | + | |
| 3291 | + | |
| 3292 | + | |
| 3293 | + | |
| 3294 | + | |
| 3295 | + | |
| 3296 | + | |
| 3297 | + | |
| 3298 | + | |
| 3299 | + | |
| 3300 | + | |
| 3301 | + | |
| 3302 | + | |
| 3303 | + | |
| 3304 | + | |
| 3305 | + | |
| 3306 | + | |
| 3307 | + | |
| 3308 | + | |
| 3309 | + | |
| 3310 | + | |
| 3311 | + | |
| 3312 | + | |
| 3313 | + | |
| 3314 | + | |
| 3315 | + | |
| 3316 | + | |
3280 | 3317 | | |
3281 | 3318 | | |
3282 | 3319 | | |
| |||
0 commit comments