Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1952,7 +1952,7 @@ dsv4-fp4-mi355x-sglang-mtp:
# gpu-mem-util=0.6. TP8 sweeps conc 4-64; DEP8 has a single conc=64
# probe to validate the ROCm DP+EP path.
dsv4-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.22.0
image: vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa
Comment thread
Fangzhou-Ai marked this conversation as resolved.
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Expand Down
13 changes: 7 additions & 6 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ set -eo pipefail
# same ROCm recipe while switching parallelism to vLLM's DP+EP form.
# Image-pin details live in amd-master.yaml.
#
# --moe-backend triton_unfused is required for the FP4 MoE expert
# weight format used by deepseek-ai/DeepSeek-V4-Pro. Letting --moe-backend
# default to auto picks a backend that doesn't register the FP4 scale
# parameters (w13_weight_scale / w2_weight_scale), so safetensors
# loading raises KeyError.
# Use the AITER MoE backend (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter)
# for the FP4 MoE expert weights of deepseek-ai/DeepSeek-V4-Pro. The AITER
# MXFP4 path registers the FP4 scale parameters (w13_weight_scale /
# w2_weight_scale), so safetensors loads correctly and decode runs on the
# fused AITER experts instead of triton_unfused.
#
# --compilation-config mode=3 with FULL_AND_PIECEWISE cudagraph mode
# enables full CUDA graph capture for improved throughput on MI355X.
Expand Down Expand Up @@ -45,6 +45,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
fi

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1

SERVER_LOG=/workspace/server.log

Expand Down Expand Up @@ -75,7 +76,7 @@ vllm serve $MODEL --port $PORT \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--moe-backend aiter \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}' > $SERVER_LOG 2>&1 &
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4407,3 +4407,12 @@
description:
- "Bump SGLang image from lmsysorg/sglang:deepseek-v4-blackwell (digest sha256:df18bfc4...) to mainline nightly lmsysorg/sglang:nightly-dev-cu13-20260628-da802ddc."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1923

- config-keys:
- dsv4-fp4-mi355x-vllm
description:
- "Bump DeepSeek-V4-Pro FP4 MI355X single-node vLLM STP image from vllm/vllm-openai-rocm:v0.22.0 to the latest nightly vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa."
- "The nightly enables two-stage attention kernels (split-KV decode), which reduce decode attention latency across all concurrency levels."
Comment thread
Fangzhou-Ai marked this conversation as resolved.
- "Employ the AITER MLA attention backend for the DeepSeek-V4 MLA path."
- "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1980