SemiAnalysisAI · Oseltamivir · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
diff --git a/.github/workflows/collectivex-experimental.yml b/.github/workflows/collectivex-experimental.yml
diff --git a/.github/workflows/collectivex-sweep.yml b/.github/workflows/collectivex-sweep.yml
@@ -0,0 +1,210 @@
+# CollectiveX Sweep — one structured run instead of thousands of dispatches.
+#
+# Shape (mirrors the InferenceX CI tracker): setup -> sweep (a MATRIX job = "a job with other jobs
+# in it") -> aggregate (the collector "at the end"). The matrix unit is a SHARD = one allocation that
+# sweeps many cases sharing (sku, backend, mode, resource) — generate_matrix's own grouping, chunked
+# so no cell exceeds the job budget. Each cell emits a handful of per-case JSONs; the aggregate job
+# collects every shard into ONE line-delimited file (results/aggregate/*.ndjson) so there aren't
+# thousands of individual result files. Run once per backend (deepep / uccl / flashinfer /
+# deepep-hybrid / nccl-ep, + deepep_v2) for full parity.
+name: CollectiveX Sweep
+on:
+  workflow_dispatch:
+    inputs:
+      backend:
+        description: "EP library to sweep — 'all' = every backend in ONE combined matrix run (recommended)"
+        type: choice
+        default: all
+        options: [all, deepep, uccl, flashinfer, deepep-hybrid, nccl-ep]
+      deepep_v2:
+        description: DeepEP V2 from-source kernels (kernel_gen=v2; only for a single-backend deepep run — 'all' already includes a deepep-v2 variant)
+        type: boolean
+        default: false
+      suites:
+        description: "'all' or comma-list of suite names"
+        type: string
+        default: all
+      only_sku:
+        description: Restrict to one SKU (h100-dgxc|h200|b300|b200-dgxc|gb200|gb300|mi355x); blank = all
+        type: string
+        default: ''
+      min_nodes:
+        description: Keep only shards with >= this tray count (2 = rack-scale EP8 only; blank = all)
+        type: string
+        default: ''
+      max_nodes:
+        description: Keep only shards with <= this tray count (1 = single-tray EP4 only; blank = all)
+        type: string
+        default: ''
+      max_cases:
+        description: Max cases per shard cell before chunking into another GHA job (128 = no chunking for current suites)
+        type: string
+        default: '128'
+
+concurrency:
+  group: cx-sweep-${{ github.ref }}-${{ inputs.backend }}-${{ inputs.deepep_v2 }}-${{ inputs.only_sku }}
+  cancel-in-progress: false
+
+jobs:
+  # ---- setup: resolve the suites into the shard matrix (the "pending jobs" node) ----
+  setup:
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.gen.outputs.matrix }}
+      n: ${{ steps.gen.outputs.n }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - run: pip install --quiet pyyaml
+      - id: gen
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          # backend='all' or a comma-list -> ONE combined multi-backend matrix; else a single backend.
+          case "${{ inputs.backend }}" in
+            all|*,*) bk="--backends ${{ inputs.backend }}" ;;
+            deepep)  bk="" ;;
+            *)       bk="--backend ${{ inputs.backend }}" ;;
+          esac
+          v2=""; [ "${{ inputs.deepep_v2 }}" = "true" ] && v2="--deepep-v2"
+          os=""; [ -n "${{ inputs.only_sku }}" ] && os="--only-sku ${{ inputs.only_sku }}"
+          mn=""; [ -n "${{ inputs.min_nodes }}" ] && mn="--min-nodes ${{ inputs.min_nodes }}"
+          xn=""; [ -n "${{ inputs.max_nodes }}" ] && xn="--max-nodes ${{ inputs.max_nodes }}"
+          # full matrix (with cases) -> artifact for the cells; slim (no cases) -> the strategy output.
+          python3 sweep_matrix.py --suites "${{ inputs.suites }}" --max-cases "${{ inputs.max_cases }}" $bk $v2 $os $mn $xn --out matrix_full.json >/dev/null
+          SLIM=$(python3 -c "import json;m=json.load(open('matrix_full.json'));print(json.dumps({'include':[{k:v for k,v in x.items() if k!='cases'} for x in m['include']]}))")
+          echo "matrix=$SLIM" >> "$GITHUB_OUTPUT"
+          echo "n=$(python3 -c "import json;print(len(json.load(open('matrix_full.json'))['include']))")" >> "$GITHUB_OUTPUT"
+          python3 -c "import json;m=json.load(open('matrix_full.json'));print('shard-cells:',len(m['include']),'cases:',sum(x['n'] for x in m['include']))"
+      - uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxsweep-matrix-${{ github.run_id }}
+          path: experimental/CollectiveX/matrix_full.json
+          if-no-files-found: error
+
+  # ---- sweep: ONE matrix cell per shard (the parent job with child jobs) ----
+  sweep:
+    needs: setup
+    if: ${{ fromJSON(needs.setup.outputs.n) > 0 }}
+    strategy:
+      fail-fast: false
+      max-parallel: 10            # don't saturate the ~20-runner fleet; cells queue as slots free
+      matrix: ${{ fromJSON(needs.setup.outputs.matrix) }}
+    # h200 label spans two clusters; pin to the validated dgxc pool (mirrors collectivex-experimental).
+    runs-on: ${{ matrix.sku == 'h200' && 'h200-dgxc' || matrix.sku }}
+    timeout-minutes: 350
+    env:
+      CX_BENCH: ${{ matrix.backend }}
+      CX_DEEPEP_V2: ${{ matrix.deepep_v2 && '1' || '' }}
+      CX_NODES: ${{ matrix.nodes }}
+      CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json
+      COLLECTIVEX_SOURCE_SHA: ${{ github.sha }}
+      # Consolidated shards run a whole build-group (up to ~74 cases) + one from-source build in ONE
+      # slurm allocation, so the launcher's default 45-min --time is too short. 120 min gives headroom;
+      # the allocation releases early when the shard finishes, so short shards don't waste it.
+      CX_TIME: '120'
+      CX_NODELIST: ${{ matrix.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }}
+      CX_STAGE_DIR: ${{ matrix.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
+        with:
+          name: cxsweep-matrix-${{ github.run_id }}
+          path: experimental/CollectiveX
+      - name: Extract this shard's cases (stdlib only — no runner deps)
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          python3 -c "
+          import json
+          m=json.load(open('matrix_full.json'))
+          s=[x for x in m['include'] if x['id']=='${{ matrix.id }}']
+          assert s, 'shard ${{ matrix.id }} not in matrix'
+          s=s[0]
+          json.dump({'id':s['id'],'sku':s['sku'],'backend':s['backend'],'nodes':s['nodes'],'deepep_v2':s['deepep_v2'],'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))
+          print('shard ${{ matrix.id }}:', len(s['cases']), 'cases')
+          "
+      - name: Sweep shard ${{ matrix.id }} (${{ matrix.n }} cases, one allocation)
+        env:
+          RUNNER_NAME: ${{ runner.name }}
+        run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
+      - name: Shard summary
+        if: always()
+        run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" || true
+      - name: Upload shard results
+        if: always()
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxshard-${{ matrix.id }}-${{ github.run_id }}
+          path: experimental/CollectiveX/results/*.json   # glob skips the hidden .shard_*.json
+          if-no-files-found: warn
+
+  # ---- aggregate: collect every shard into ONE ndjson (the "result aggregator at the end") ----
+  aggregate:
+    needs: sweep
+    if: always()
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
+        with: { clean: true }
+      - uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
+        with:
+          pattern: cxshard-*-${{ github.run_id }}
+          path: _shards
+          merge-multiple: true
+      - name: Aggregate shards -> one ndjson
+        working-directory: experimental/CollectiveX
+        run: |
+          set -euo pipefail
+          tag="${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}"
+          python3 aggregate_results.py --in-dir ../../_shards --out "results/aggregate/collectivex_${tag}_${{ github.run_id }}.ndjson"
+          {
+            echo "## CollectiveX sweep aggregate (${tag})"
+            echo '```'
+            wc -l results/aggregate/*.ndjson 2>/dev/null || echo "no ndjson"
+            echo '```'
+          } >> "$GITHUB_STEP_SUMMARY"
+      - name: Upload aggregate
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+        with:
+          name: cxsweep-aggregate-${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}-${{ github.run_id }}
+          path: experimental/CollectiveX/results/aggregate/*.ndjson
+          if-no-files-found: warn
+
+  update-frontend-snapshot:
+    name: Update InferenceX-app snapshot
+    needs: aggregate
+    if: always() && needs.aggregate.result == 'success'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger CollectiveX snapshot update
+        env:
+          FRONTEND_PAT: ${{ secrets.INFX_FRONTEND_PAT }}
+        run: |
+          set -euo pipefail
+          tmp="$(mktemp -d)"
+          trap 'rm -rf "$tmp"' EXIT
+          git clone --quiet --depth 1 --branch collectivex \
+            "https://x-access-token:${FRONTEND_PAT}@github.com/SemiAnalysisAI/InferenceX-app.git" \
+            "$tmp/app"
+          cd "$tmp/app"
+          git pull --rebase origin collectivex
+          mkdir -p .github
+          {
+            echo "source_run_id=${{ github.run_id }}"
+            echo "source_sha=${{ github.sha }}"
+            echo "source_workflow=${{ github.workflow }}"
+            echo "source_run_url=https://github.com/SemiAnalysisAI/InferenceX/actions/runs/${{ github.run_id }}"
+            echo "triggered_at=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+          } > .github/collectivex-source-run.env
+
+          git config user.name "InferenceX Data Bot"
+          git config user.email "actions@users.noreply.github.com"
+          git add .github/collectivex-source-run.env
+          if git diff --cached --quiet; then
+            echo "CollectiveX source-run marker is already current."
+            exit 0
+          fi
+          git commit -m "chore: trigger CollectiveX data update for ${{ github.run_id }}"
+          git push origin HEAD:collectivex
diff --git a/experimental/CollectiveX/.gitignore b/experimental/CollectiveX/.gitignore
@@ -0,0 +1,22 @@
+# in-container nccl-tests build cache
+.nccl-tests/
+# python
+__pycache__/
+*.pyc
+# generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs,
+# so keep results out of git (CI uploads them as workflow artifacts instead).
+# Sanitized headline numbers live in CONTAINERS.md.
+results/*.json
+results/plots/
+results/raw_*.txt
+results/raw_*.txt.stderr
+# superseded SSH-provenance result JSONs moved aside so plot_ep's recursive glob
+# won't double-load them; same hostname/UUID sensitivity as results/.
+_ssh_v4_archive/
+# running local-only reflection log (not a committed artifact)
+notes.md
+goal.md
+# superseded seeded-runtime GHA results (canonical counterpart exists); kept out of the plot glob
+_seeded_archive/
+# newest-good-per-config kept in results/; superseded runs moved here (out of the plot glob)
+_superseded/
diff --git a/experimental/CollectiveX/CONTAINERS.md b/experimental/CollectiveX/CONTAINERS.md
@@ -0,0 +1,75 @@
+# CollectiveX — container & library versions
+
+One **multi-arch, digest-pinned** container is used for all NVIDIA SKUs, so B200
+(x86_64) and GB200 (aarch64) share a single reference and the cross-vendor
+comparison is truly same-image. Set in `runtime/common.sh` (`cx_default_image`).
+
+## Default container (all NVIDIA SKUs)
+
+- **Image:** import by tag **`lmsysorg/sglang:v0.5.11-cu130`** (multi-arch OCI index). Expected index digest, recorded for provenance/verification: `sha256:061fb71f838e82000a1768c159654d526c2f17ebe751c21e7fc48ca53c8ef975`.
+- **Multi-arch manifest list:** linux/amd64 + linux/arm64; `enroot import` on each host pulls the matching arch.
+- **Import by TAG, not digest.** enroot builds its anonymous Docker Hub token scope from the *tag* and succeeds (no creds needed — same as the serving launchers). A bare `repo@sha256:` ref makes enroot prompt for a password and **hang** in non-interactive CI; a combined `tag@sha256:` ref 400s. `cx_ensure_squash` therefore imports by tag with `</dev/null` (a missing token fails fast instead of hanging). First import is multi-GB (~minutes); subsequent runs reuse the staged squash.
+- **Why v0.5.11-cu130 (chosen):** it's the newest cu130 release **pre-staged on BOTH clusters** — B200 `/home/sa-shared/containers/` (amd64 squash) and GB200 `/mnt/lustre01/users-public/sa-shared/` (arm64 squash), same filename — so neither side imports at all. (Shared cu130 multi-arch squashes across both clusters: v0.5.8.post1, v0.5.9, v0.5.11 — v0.5.11 is newest.) `v0.5.12-cu130` is staged on B200 but **not** GB200: its 62 layers overflow enroot's overlay-based squash creation on the GB200 kernel (`enroot-mksquashovlfs: failed to mount overlay … Invalid argument`), so it can't be the shared default.
+- **DeepEP: NOT bundled** here → `run_in_container.sh` builds it via `rebuild-deepep` at job setup (CX_BENCH=deepep). The NCCL path needs no DeepEP.
+- **nccl-tests build:** in-container (login nodes have no `nvcc`), `CX_NCCL_HOME=/usr` (system `nccl.h` in `/usr/include`), `CX_CUDA_HOME=/usr/local/cuda`. cu130 lineage ⇒ CUDA 13; confirm exact NCCL/torch on first run and append below.
+
+## Audited reference (cu130 lineage)
+
+Live audit of the sibling DeepSeek-V4 image `lmsysorg/sglang:deepseek-v4-grace-blackwell` (aarch64) on GB200, 2026-06-23 — the multi-arch `v0.5.11-cu130` should match closely (same cu130 base); reconfirm on first run:
+
+| Component | Version |
+|---|---|
+| OS / arch | Ubuntu 24.04.3, aarch64 |
+| CUDA (`nvcc`) | 13.0 (V13.0.88) |
+| NCCL (system `/usr/include/nccl.h`) | 2.28.3; torch-bundled 2.27.7 |
+| PyTorch | 2.9.1+cu130 |
+| DeepEP | bundled in *that* image; **not** in the multi-arch default |
+| NVSHMEM | `libnvshmem_host.so.3` present |
+| OpenMPI / gcc / make | 4.1.6 / 13.3.0 / 4.3 |
+| GPU / driver | GB200, 580.126.20 |
+
+**Version caveat:** the nccl-tests binary links **system NCCL** (2.28.x), while torch/DeepEP use the **bundled** NCCL (2.27.x). Record both in provenance (env_capture does); don't compare an nccl-tests curve against a DeepEP run as if NCCL were identical.
+
+## Bundled-DeepEP reference images (not the default)
+
+If a bundled DeepEP is needed before `rebuild-deepep` is wired on the multi-arch image, these arch-specific images bundle it (pin by digest):
+
+- B200 (amd64): `lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b` (pre-staged on B200)
+- GB200 (arm64): `lmsysorg/sglang:deepseek-v4-grace-blackwell@sha256:4f583347d7ff08aef7e16dbb4985b2a7c147ff49a0c261d5e27b8f5f41719368` (staged on GB200 Lustre)
+
+Select via `CX_IMAGE=…@sha256:…` on the launch script.
+
+## AMD container (MI355X) — MoRI EP
+
+AMD CDNA4 cannot run the CUDA multi-arch image; MI355X uses a ROCm image that
+bundles **MoRI** (AMD's EP dispatch/combine library). Set in `cx_default_image`
+for `mi355x*` (also `mi350x*`/`mi325x*`/`mi300x*`).
+
+- **Image:** `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2` (single-arch ROCm 7.2.0 runtime; from the AMD master serving config). **Not digest-pinned yet** — record the digest here and pin once validated on the runner, like the NVIDIA image.
+- **MoRI:** bundled in-image (build tag `mori-0227`). `tests/ep_mori.py` follows the upstream `ROCm/mori` `tests`/`examples` dispatch+combine path; capture the exact MoRI commit (`MORI_COMMIT` env → provenance) on first run.
+- **Squash is NODE-LOCAL** (`/var/lib/squash`), not a shared FS, so `launch_mi355x-amds.sh` imports via `srun` on the allocated node (the NVIDIA adapters import on the login node onto shared FS). pyxis flags `--container-writable --container-remap-root` (matches the AMD serving launcher); workspace is bind-mounted directly (no `CX_STAGE_DIR`).
+- **Transport:** intra-node **XGMI** (8× MI355X). Two backends wired: `CX_BENCH=mori` (MoRI EP dispatch/combine) and `CX_BENCH=nccl` (collective primitives via **rccl-tests**, the ROCm nccl-tests fork — built in-container with `make` against `/opt/rocm`/`amdclang++`/`librccl`; same `<op>_perf` binaries + output format as nccl-tests, so `run_nccl.py` parses it unchanged).
+- **Validated on MI355X** (on-node via `salloc`+`srun`, nodes `mia1-p01-g10`/`g15`): `salloc` → enroot import (anonymous auth + tag, 24 layers → ~60 GB node-local squash) → torchrun → 8-rank Gloo + MoRI shmem → `EpDispatchCombineConfig`/dispatch/combine **numerically correct** (combine within tol, `max_rel ~2e-3`, ~85 µs round-trip at the decode shape). Three ionic_rdma-fabric constraints, all handled in `tests/ep_mori.py`:
+  - **RDMA MR size ceiling (~4 GiB).** MoRI registers the *entire* symmetric heap as one RDMA MR at init — even single-node (no disable-RDMA knob exists; only `MORI_DISABLE_P2P`, which forces the opposite). On these ionic NICs a 6 GiB MR fails (`RegisterRdmaMemoryRegion … errno 22 EINVAL`) while 2 GiB registers. Heap is held at **`MORI_SHMEM_HEAP_SIZE=2G`** (override `CX_MORI_HEAP_SIZE`). The reference test's hardcoded `6G` is exactly why it can't run as-is here.
+  - **Buffer sizing.** `max_num_inp_token_per_rank` is bounded (512 at the decode shape) so dispatch/combine buffers fit the 2 GiB heap. Much larger token counts would need a heap past the MR ceiling — out of reach on this fabric for now.
+  - **Teardown.** MoRI's shmem teardown asserts (`CheckStatusValid` → SIGABRT) when the op is destroyed after `shmem_finalize()`; `tests/ep_mori.py`'s `finalize()` hard-exits after writing results to avoid it.
+
+  Still TODO: capture the exact MoRI commit + a version table (ROCm/torch/RCCL) into provenance, and digest-pin the image.
+
+## Cluster access / QOS
+
+- **B200** (`slurm-login-slinky`): account `benchmark`, **only `gpu-2_qos`** → partition `gpu-2` only (shared with the serving sweep). `gpu-1`/`all` (idle) need `gpu-1_qos`/`all_qos`, not associated with this account.
+- **GB200** (`watchtower`): account `benchmark`, qos `normal`, partition `batch` (`AllowQos=ALL`); idle capacity available. Runner workspace is **not** compute-visible → set `CX_STAGE_DIR` to a Lustre path (the launcher rsyncs there).
+
+## First real results (Milestone-0 spike, on the DeepSeek-V4 images)
+
+nccl-tests (system NCCL 2.28.3), all correctness-passed, peak bus-bw:
+
+| op | B200 8× (NVLink island, x86_64) | GB200 4× (NVL72 MNNVL, aarch64) |
+|---|---|---|
+| all_reduce | 835 GB/s | 689 GB/s |
+| all_gather | 653 | 658 |
+| reduce_scatter | 667 | 661 |
+| alltoall | 638 | 666 |
+
+(B200 vs GB200 carry distinct `comparison_key`s by topology-class, so they are labelled-distinct, not silently merged. Re-run on the multi-arch default to refresh under one image.)