Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11848,6 +11848,34 @@ qwen3.5-fp8-h100-sglang-agentic:
- { tp: 8, ep: 8, offloading: none, conc-list: [1, 2, 4, 8, 12, 14, 16] }
- { tp: 8, ep: 8, offloading: hicache, conc-list: [12, 14, 16, 20, 24, 28, 32, 42] }

minimaxm3-fp4-b200-dynamo-vllm:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
model: nvidia/MiniMax-M3-NVFP4
model-prefix: minimaxm3
runner: b200-multinode
precision: fp4
framework: dynamo-vllm
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 8192
osl: 1024
search-space:
- conc-list: [2048]
prefill:
num-worker: 2
tp: 2
ep: 2
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true

minimaxm3-fp8-b300-dynamo-vllm:
image: vllm/vllm-openai:minimax-m3-0618-x86_64-cu130
model: MiniMaxAI/MiniMax-M3-MXFP8
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: "minimax-m3-vllm-disagg-b200-2p1d-fp4-dep2-dep8-8k1k"

model:
path: "nvidia/MiniMax-M3-NVFP4"
container: "vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41"
precision: "fp4"
Comment on lines +3 to +6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The launcher's model-path alias key and the recipe's model.path don't match for this new b200-dgxc + minimaxm3-fp4 pairing: runners/launch_b200-dgxc.sh:74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" but the new recipe at line 4 uses path: "nvidia/MiniMax-M3-NVFP4", so srtctl'''s model_paths lookup misses. Every other analogous case in the tree (b300-nv minimaxm3-fp4, b200 minimaxm2.5-fp4/fp8, b200 dsv4-fp4) has the launcher prefix exactly equal to the recipe path. Fix by changing one side to match the other — e.g. set SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4" for the minimaxm3-fp4 branch on b200-dgxc (mirroring runners/launch_b300-nv.sh:52), or change the new recipe'''s model.path to "minimax-m3-nvfp4".

Extended reasoning...

The mismatch

runners/launch_b200-dgxc.sh:71-74 (pre-existing from PR #1932) sets up the minimaxm3-fp4 model resolution:

elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then
    # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree.
    export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4"
    export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"

The launcher then writes srtslurm.yaml (around line 157):

model_paths:
  "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4"

But the new recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml:4 uses:

model:
  path: "nvidia/MiniMax-M3-NVFP4"

srtctl looks up "nvidia/MiniMax-M3-NVFP4" in model_paths — the only registered alias is "minimax-m3-nvfp4", so the lookup misses.

Why this PR is the trigger

The pre-existing lines 71-74 were previously only exercised through the single-node code path (the else branch at the bottom of the launcher), which never touches SRT_SLURM_MODEL_PREFIX — it mounts $MODEL_PATH directly via --container-mounts and sets MODEL=$MODEL_PATH. So the mismatch was benign.

This PR adds the new elif at launch_b200-dgxc.sh:116-121 that first routes minimaxm3-fp4 through the srtctl / srt-slurm code path, which is the path that actually consumes SRT_SLURM_MODEL_PREFIX as an alias key in srtslurm.yaml. So the pre-existing but previously-latent mismatch becomes load-bearing exactly at this PR.

Cross-check with every other b200/b300 case

Launcher case SRT_SLURM_MODEL_PREFIX Recipe model.path Match?
b300-nv minimaxm3-fp4 (launch_b300-nv.sh:52) nvidia/MiniMax-M3-NVFP4 nvidia/MiniMax-M3-NVFP4
b300-nv minimaxm3-fp8 (launch_b300-nv.sh:55) MiniMaxAI/MiniMax-M3-MXFP8 MiniMaxAI/MiniMax-M3-MXFP8
b200-dgxc minimaxm2.5-fp4 minimax-m2.5-nvfp4 minimax-m2.5-nvfp4
b200-dgxc minimaxm2.5-fp8 minimax-m2.5-fp8 minimax-m2.5-fp8
b200-dgxc dsv4-fp4 deepseek-v4-pro deepseek-v4-pro
b200-dgxc minimaxm3-fp4 (this PR) minimax-m3-nvfp4 nvidia/MiniMax-M3-NVFP4

Every other pairing in the tree matches exactly; the new b200-dgxc minimaxm3-fp4 case is the sole outlier. The b300 minimaxm3-fp4 case in particular is instructive because the new b200 recipe is a direct port of the b300 4p2d-dep2-dep8-8k1k recipe (per the PR description), so it inherits nvidia/MiniMax-M3-NVFP4 — which matches on b300 but not on b200.

Step-by-step proof of the failure

  1. CI dispatches minimaxm3-fp4-b200-dynamo-vllm on the b200-multinode runner.
  2. launch_b200-dgxc.sh runs with IS_MULTINODE=true, MODEL_PREFIX=minimaxm3, PRECISION=fp4, FRAMEWORK=dynamo-vllm.
  3. Line 74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" and MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4".
  4. The new elif at lines 116-121 fires, clones srt-slurm, and copies the recipe into recipes/vllm/minimax-m3/b200-fp4/.
  5. The cat > srtslurm.yaml <<EOF block writes model_paths: { "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4" }.
  6. srtctl apply -f $CONFIG_FILE is invoked; srtctl parses the recipe and reads model.path: "nvidia/MiniMax-M3-NVFP4".
  7. srtctl checks model_paths for the key "nvidia/MiniMax-M3-NVFP4" — not present.
  8. Outcome A: srtctl errors on unknown alias and the job fails immediately at srtctl apply. Outcome B: srtctl treats the unmatched value as a HuggingFace hub identifier and attempts to download nvidia/MiniMax-M3-NVFP4 from the hub on every job invocation, negating the pre-staging that the comment on launch_b200-dgxc.sh:72 explicitly relies on.

Either outcome makes the full-sweep check fail — either the job errors out at model resolution, or the HF pull blows the container FS / times out the runner. The PR is labeled full-sweep-fail-fast-no-canary, so this will surface as a fail-fast failure.

Fix

One-liner in either direction. The b300 side of the tree is the reference implementation, so the least surprising change is to line 74 of runners/launch_b200-dgxc.sh:

 elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then
     # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree.
     export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4"
-    export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"
+    export SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4"

This also matches runners/launch_b300-nv.sh:52 verbatim, keeping the two clusters consistent for the same model+precision.


resources:
gpu_type: "b200"
gpus_per_node: 8
prefill_nodes: 1
decode_nodes: 1
prefill_workers: 2
decode_workers: 1
gpus_per_prefill: 2
gpus_per_decode: 8

dynamo:
install: true
version: 1.3.0.dev20260614

frontend:
type: dynamo
enable_multiple_frontends: false

backend:
type: vllm
connector: null

prefill_environment:
VLLM_FLOAT32_MATMUL_PRECISION: high
UCX_TLS: "cuda_copy,rc"

decode_environment:
VLLM_FLOAT32_MATMUL_PRECISION: high
UCX_TLS: "cuda_copy,rc"

vllm_config:
prefill:
no-enable-flashinfer-autotune: true
tensor-parallel-size: 1
data-parallel-size: 2
data-parallel-rpc-port: 13345
enable-expert-parallel: true
trust-remote-code: true
no-enable-prefix-caching: true
kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
attention-config: '{"backend": "FLASHINFER", "use_trtllm_attention": true, "indexer_kv_dtype": "fp8"}'
kv-cache-dtype: fp8
block-size: 128
gpu-memory-utilization: 0.90
max-model-len: 9472
language-model-only: true
stream-interval: 32

decode:
no-enable-flashinfer-autotune: true
tensor-parallel-size: 1
data-parallel-size: 8
data-parallel-rpc-port: 13345
enable-expert-parallel: true
trust-remote-code: true
no-enable-prefix-caching: true
kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
attention-config: '{"backend": "FLASHINFER", "use_trtllm_attention": true, "indexer_kv_dtype": "fp8"}'
kv-cache-dtype: fp8
block-size: 128
gpu-memory-utilization: 0.90
max-model-len: 9472
language-model-only: true
stream-interval: 32
max-num-seqs: 1024 # One DP8 decode worker provides 8 DP ranks.
max-num-batched-tokens: 16384
max-cudagraph-capture-size: 4096

health_check:
max_attempts: 360
interval_seconds: 10

benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "2048"
req_rate: "inf"
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4407,3 +4407,11 @@
description:
- "Bump SGLang image from lmsysorg/sglang:deepseek-v4-blackwell (digest sha256:df18bfc4...) to mainline nightly lmsysorg/sglang:nightly-dev-cu13-20260628-da802ddc."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1923

- config-keys:
- minimaxm3-fp4-b200-dynamo-vllm
description:
- "Add MiniMax-M3 NVFP4 B200 Dynamo-vLLM disaggregated 8k1k configuration at concurrency 2048."
- "Port the B300 4P2D DEP2/DEP8 recipe to a B200 2P1D topology using one prefill node and one decode node."
- "Use the b200-multinode runner and vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41; omit max-cudagraph-capture-size and max-num-batched-tokens from prefill."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1982
6 changes: 6 additions & 0 deletions runners/launch_b200-dgxc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,12 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
git checkout aflowers/vllm-gb200-v0.20.0
mkdir -p recipes/vllm/deepseek-v4
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR" || exit 1
git checkout main
mkdir -p recipes/vllm/minimax-m3/b200-fp4
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4" recipes/vllm/minimax-m3/b200-fp4
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" && $PRECISION == "fp4" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR" || exit 1
Expand Down
Loading