Skip to content

Commit b72f05f

Browse files
unamedkrclaude
andcommitted
CLI default KV type: uniform_4b → turbo_kv_4b
Variant F validation across two models confirms turbo_kv_4b beats uniform_4b at the same 4-bit budget on both: SmolLM2 135M (FP32 18.62): uniform_4b 20.33 (+9.2%) turbo_kv_4b 19.70 (+5.8%) ✅ -3.1% PPL improvement Llama 3.2 3B (FP32 13.56): uniform_4b 14.41 (+6.3%) turbo_kv_4b 14.28 (+5.3%) ✅ -0.9% PPL improvement Smaller model = larger relative improvement, consistent with the finer codebook (16 levels vs 15) capturing more of the per-block distribution detail. Switching the CLI default so users get the better quantization without having to know the type name. uniform_4b remains available via -k uniform_4b. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 61c0f54 commit b72f05f

1 file changed

Lines changed: 5 additions & 5 deletions

File tree

tools/quant.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
* -P <top_p> Top-p nucleus sampling (default: 0.9)
1313
* -k <kv_type> KV cache type: fp32, uniform_4b, uniform_2b,
1414
* polar_3b, polar_4b, turbo_3b, turbo_4b,
15-
* turbo_kv_1b, turbo_kv_3b, turbo_kv_4b (default: uniform_4b)
15+
* turbo_kv_1b, turbo_kv_3b, turbo_kv_4b (default: turbo_kv_4b)
1616
* -v <vq> Value cache quantization: q4 (4-bit), q2 (2-bit),
1717
* or fp16 (default: fp16 when -k is set, fp32 otherwise)
1818
* -j <threads> Number of threads for matmul (default: 4)
@@ -71,7 +71,7 @@ static void print_token(const char* text, void* user_data) {
7171

7272
/* Parse KV type from string */
7373
static tq_type parse_kv_type(const char* s) {
74-
if (!s) return TQ_TYPE_UNIFORM_4B;
74+
if (!s) return TQ_TYPE_TURBO_KV_4B;
7575
if (strcmp(s, "fp32") == 0) return TQ_TYPE_COUNT; /* sentinel for FP32 */
7676
if (strcmp(s, "uniform_4b") == 0) return TQ_TYPE_UNIFORM_4B;
7777
if (strcmp(s, "uniform_2b") == 0) return TQ_TYPE_UNIFORM_2B;
@@ -85,8 +85,8 @@ static tq_type parse_kv_type(const char* s) {
8585
if (strcmp(s, "qjl_1b") == 0) return TQ_TYPE_QJL_1B;
8686
if (strcmp(s, "mixed_4b8") == 0) return TQ_TYPE_MIXED_4B8;
8787
if (strcmp(s, "uniform_3b") == 0) return TQ_TYPE_UNIFORM_3B;
88-
fprintf(stderr, "Unknown KV type: %s (using uniform_4b)\n", s);
89-
return TQ_TYPE_UNIFORM_4B;
88+
fprintf(stderr, "Unknown KV type: %s (using turbo_kv_4b)\n", s);
89+
return TQ_TYPE_TURBO_KV_4B;
9090
}
9191

9292
#define QUANT_VERSION "0.2.0"
@@ -145,7 +145,7 @@ int main(int argc, char** argv) {
145145
int max_tokens = 256;
146146
float temperature = 0.7f;
147147
float top_p = 0.9f;
148-
tq_type kv_type = TQ_TYPE_UNIFORM_4B;
148+
tq_type kv_type = TQ_TYPE_TURBO_KV_4B;
149149
int n_threads = 4;
150150
int quant_mode = 0; /* 0 = none (default), 2 = Q2, 4 = Q4, 8 = Q8 */
151151
int value_quant_bits = 0; /* 0 = FP16/FP32 (default), 4 = Q4, 2 = Q2 */

0 commit comments

Comments
 (0)