Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1979,7 +1979,7 @@ dsv4-fp4-mi355x-vllm:
# above ~conc32 (-37% @ conc32). Image reuses the base entry's v0.22.0 ROCm
# build, which already contains the MTP commit.
dsv4-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:v0.22.0
image: vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The block comment immediately above this entry (lines 1978-1980) still reads "Image reuses the base entry's v0.22.0 ROCm build, which already contains the MTP commit." With this bump, the MTP variant is now on a nightly while the base entry dsv4-fp4-mi355x-vllm stays on v0.22.0, so that rationale is stale. Consider replacing those two sentences with a note about the intentional divergence and the new rationale (two-stage attention kernels + AITER MLA) already documented in the PR description and perf-changelog.

Extended reasoning...

What's stale. The trailing sentences of the block comment at .github/configs/amd-master.yaml:1978-1980 claim:\n\n> Image reuses the base entry's v0.22.0 ROCm build, which already contains the MTP commit.\n\nThat rationale explained why the two entries could share an image tag. It no longer holds.\n\nStep-by-step proof of the divergence.\n\n1. Base entry dsv4-fp4-mi355x-vllm at line 1955 still pins image: vllm/vllm-openai-rocm:v0.22.0 (unchanged by this PR).\n2. This PR changes the MTP variant at line 1982 from vllm/vllm-openai-rocm:v0.22.0 to vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa.\n3. Therefore the two image strings now differ, and "reuses the base entry's v0.22.0 ROCm build" is factually wrong.\n\nWhy the existing wording will mislead. A future reader landing on this recipe will read the block comment, see "reuses the base entry's v0.22.0 ROCm build," and assume the two entries track the same image — for example when doing a future bump they might touch only one entry and expect the other to follow. The PR description already spells out the real reason for the bump (nightly enables two-stage attention kernels / split-KV decode and the AITER MLA backend for the DSv4 MLA path), and the perf-changelog entry restates it. That rationale belongs in the inline comment now that the images have diverged.\n\nImpact. Documentation-only — no functional change, sweep behavior is unaffected. Filing as nit since it's worth fixing while the change is fresh (the author has the context right now) but does not need to block merge.\n\nSuggested fix. Replace the trailing two sentences of the comment (roughly lines 1978-1980) with something like:\n\n> Previously reused the base entry's v0.22.0 image; bumped to a nightly to pick up two-stage attention kernels (split-KV decode) and the AITER MLA backend for the DSv4 MLA path. Base entry stays pinned to v0.22.0 intentionally.

model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ set -eo pipefail
# prompts silently regresses the acceptance rate.
#
# All other serving flags mirror the non-MTP MI355X recipe (TP=8,
# VLLM_ROCM_USE_AITER=1, triton_unfused MoE, FP8 KV cache, mp executor, async
# VLLM_ROCM_USE_AITER=1, AITER MoE, FP8 KV cache, mp executor, async
# scheduling, mode=3 FULL_AND_PIECEWISE compilation). See
# dsv4_fp4_mi355x_vllm.sh for per-flag rationale.

Expand Down Expand Up @@ -40,6 +40,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
fi

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Expand Down Expand Up @@ -74,7 +75,7 @@ vllm serve $MODEL --port $PORT \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--moe-backend aiter \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4407,3 +4407,12 @@
description:
- "Bump SGLang image from lmsysorg/sglang:deepseek-v4-blackwell (digest sha256:df18bfc4...) to mainline nightly lmsysorg/sglang:nightly-dev-cu13-20260628-da802ddc."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1923

- config-keys:
- dsv4-fp4-mi355x-vllm-mtp
description:
- "Bump DeepSeek-V4-Pro FP4 MI355X single-node vLLM MTP image from vllm/vllm-openai-rocm:v0.22.0 to the latest nightly vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa."
- "The nightly enables two-stage attention kernels (split-KV decode), which reduce decode attention latency across all concurrency levels."
- "Employ the AITER MLA attention backend for the DeepSeek-V4 MLA path."
- "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1981
Loading