Skip to content

Commit a30b976

Browse files
authored
fix inference crashed on v100 with qwen3.5-0.8b (#4420)
1 parent 4c9e86c commit a30b976

2 files changed

Lines changed: 5 additions & 3 deletions

File tree

lmdeploy/turbomind/deploy/source_model/qwen.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,9 @@ def __init__(self, *args, **kwargs):
240240
self.attn_layer_prefix = 'model.language_model.layers'
241241
self.tok_embeddings_key = 'model.language_model.embed_tokens.weight'
242242
self.norm_weight_key = 'model.language_model.norm.weight'
243-
243+
tie_word_embeddings = self.model_cfg.get('tie_word_embeddings', False)
244+
if tie_word_embeddings:
245+
self.output_weight_key = self.tok_embeddings_key
244246
# ---- zero-centered RMSNorm: add 1 to weights during export ----
245247

246248
def attn_norm(self, i: int):

src/turbomind/kernels/attention/kernel/decoding_sm70_256.cu

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@
1212
namespace turbomind::attention {
1313

1414
constexpr int kHeadDim = 256;
15-
constexpr int kCTA_S = 64;
15+
constexpr int kCTA_S = 32;
1616
constexpr int kWARP_S = 16;
17-
constexpr int kStages = 3;
17+
constexpr int kStages = 2;
1818

1919
// kH = Qh%3==0 ? 3 : (Qh%2==0 ? 2 : 1)
2020
// kH=1 covers Qh ∈ {1,5,7}, kH=2 covers {2,4,8}, kH=3 covers {3,6,9}

0 commit comments

Comments
 (0)