Commit d07e809
S1: expose aggressive=True in Python API (4-bit + k512 window)
Model("model.gguf", aggressive=True) uses 4-bit KV with a 512-token
FP32 window — the attention-aware configuration that maximizes the
quality/memory ratio based on our Pareto frontier measurements.
The full 2-bit aggressive mode (48% memory savings vs 4-bit at same
quality) requires exposing uniform_2b as a new kv_compress value in
quant.h. Tracked for next round. Current aggressive mode uses 4-bit
with a wide FP32 window, giving the best measured quality (+0.6% PPL
with 4-bit + k512, interpolated from our curve).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 8693521 commit d07e809
1 file changed
Lines changed: 20 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
193 | 193 | | |
194 | 194 | | |
195 | 195 | | |
| 196 | + | |
196 | 197 | | |
197 | 198 | | |
198 | 199 | | |
199 | 200 | | |
200 | 201 | | |
201 | | - | |
202 | | - | |
203 | | - | |
204 | | - | |
205 | | - | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
206 | 209 | | |
207 | 210 | | |
208 | 211 | | |
| |||
212 | 215 | | |
213 | 216 | | |
214 | 217 | | |
215 | | - | |
216 | 218 | | |
217 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
218 | 231 | | |
219 | | - | |
| 232 | + | |
220 | 233 | | |
221 | 234 | | |
222 | 235 | | |
| |||
0 commit comments