Skip to content

Commit 7d10b63

Browse files
unamedkrclaude
andcommitted
fix: support head_dim > TQ_BK (Qwen3.5 head_dim=256)
Qwen3.5 uses head_dim=256 while TQ_BK=128. Two fixes: 1. KV quantize: loop over TQ_BK-sized blocks per head instead of truncating to 128 elements. The remaining 128 dims were being zeroed, causing catastrophic PPL (1.6M → garbage output). 2. Fused attention: guard with head_dim <= TQ_BK. For head_dim=256, fall back to dequant → FP32 attention path. This is slower but correct. Fused multi-block attention is a future optimization. Before: Qwen3.5 + turbo_kv_4b → PPL 1,663,480 (garbage) After: Qwen3.5 + turbo_kv_4b → coherent text at 12.2 tok/s 35/35 tests pass. Llama models (head_dim=128) unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent bc44a6a commit 7d10b63

1 file changed

Lines changed: 12 additions & 4 deletions

File tree

src/engine/tq_transformer.c

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1440,8 +1440,16 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
14401440
traits->quantize(delta_buf, quant_dst, head_dim);
14411441
}
14421442
} else {
1443-
/* Non-delta mode: quantize absolute key */
1444-
traits->quantize(key_src, quant_dst, head_dim);
1443+
/* Non-delta mode: quantize absolute key.
1444+
* For head_dim > TQ_BK (e.g. Qwen3.5 head_dim=256),
1445+
* process multiple TQ_BK-sized blocks per head. */
1446+
for (int blk = 0; blk < head_dim; blk += TQ_BK) {
1447+
int blen = head_dim - blk;
1448+
if (blen > TQ_BK) blen = TQ_BK;
1449+
traits->quantize(key_src + blk,
1450+
quant_dst + (blk / TQ_BK) * traits->type_size,
1451+
blen);
1452+
}
14451453
}
14461454
}
14471455
}
@@ -1508,7 +1516,7 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
15081516
float* atth = s->att + (size_t)h * c->max_seq_len;
15091517
int kv_h = h / kv_mul;
15101518

1511-
if (use_int_attn && seq_len > int_attn_threshold) {
1519+
if (use_int_attn && seq_len > int_attn_threshold && head_dim <= TQ_BK) {
15121520
/* Integer Q4xQ8 attention path.
15131521
* Gather quantized key blocks for this KV head across all positions
15141522
* into a contiguous buffer, then call the traits attention function.
@@ -1659,7 +1667,7 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
16591667
* contiguous block, which is cache-efficient.
16601668
*/
16611669
if (!needs_post_norm && !k_hr_active && traits->attention != NULL
1662-
&& attn_start == 0) {
1670+
&& attn_start == 0 && head_dim <= TQ_BK) {
16631671
size_t head_block_bytes = s->quant_head_stride;
16641672
size_t pos_stride_bytes = (size_t)cache_n_kv_heads * head_block_bytes;
16651673
uint8_t* layer_base = (uint8_t*)s->quant_key_cache

0 commit comments

Comments
 (0)