Skip to content

feat(utils.sh): add ezpz_load_modules_{aurora,sunspot,polaris}#131

Open
saforem2 wants to merge 1 commit intomainfrom
feat/load-modules-functions
Open

feat(utils.sh): add ezpz_load_modules_{aurora,sunspot,polaris}#131
saforem2 wants to merge 1 commit intomainfrom
feat/load-modules-functions

Conversation

@saforem2
Copy link
Copy Markdown
Owner

@saforem2 saforem2 commented May 5, 2026

Summary

Adds three symmetric module-loader functions to `bin/utils.sh`:

  • `ezpz_load_modules_aurora`
  • `ezpz_load_modules_sunspot`
  • `ezpz_load_modules_polaris`

Each loads everything its system's canonical environment module would
have pulled in (the framework module on Aurora/Sunspot, the conda
module on Polaris) except the framework/conda module itself — so
a user can stand up their own `uv venv` on top of the system stack
without dragging in the prebuilt PyTorch / Python.

Pairs naturally with `ezpz tar-env` + `ezpz yeet`:

```bash
ezpz_load_modules_polaris # or _aurora / _sunspot
uv venv --python=$(which python3)
source .venv/bin/activate
uv pip install ...
ezpz tar-env
ezpz yeet .venv.tar.gz
```

Implementation notes

  • Aurora loads `oneapi/release/2025.3.1 hdf5 pti-gpu` (matches
    `ezpz_setup_xpu`) and exports XPU runtime envvars + Aurora-specific
    `FI_MR_CACHE_MONITOR=userfaultfd` (matches
    `ezpz_setup_conda_aurora`).
  • Sunspot is the same minus the FI_MR_CACHE_MONITOR (matches
    `ezpz_setup_conda_sunspot`).
  • Polaris loads the same module deps the conda module declares
    (`PrgEnv-gnu`, `craype-x86-milan`, `cray-hdf5-parallel`, `cudnn`,
    `gcc-native`) and exports the matching envvar set (CUDA paths,
    NCCL/TensorRT lib paths, ALCF proxy, MPI/JAX hints).
    Reverse-engineered from
    `/soft/modulefiles/conda/2025-09-25.lua` to stay in sync with the
    canonical module's behavior.

Test plan

  • `bash -n src/ezpz/bin/utils.sh` — syntax check
  • `source src/ezpz/bin/utils.sh && ezpz_load_modules_aurora` on a real Aurora login node — verify modules load and `which python3` resolves
  • Same on Sunspot
  • Same on Polaris
  • End-to-end on each system: `ezpz_load_modules_` → `uv venv --python=$(which python3)` → install `ezpz` → `ezpz test`

Summary by Sourcery

Add helper functions to load system module stacks for ALCF Aurora, Sunspot, and Polaris without activating their framework/conda environments, enabling custom virtualenvs on top of the canonical stacks.

New Features:

  • Introduce ezpz_load_modules_aurora to load Aurora’s oneAPI and related modules while configuring XPU runtime environment variables.
  • Introduce ezpz_load_modules_sunspot to load Sunspot’s oneAPI-based module stack and matching XPU runtime environment variables.
  • Introduce ezpz_load_modules_polaris to load Polaris’s CUDA, HDF5, and toolchain modules and set corresponding CUDA, NCCL, TensorRT, proxy, and MPI/JAX environment variables.

Symmetric to ezpz_setup_xpu, but covers the full ALCF module stack
each system needs for an externally-managed Python (e.g. fresh
`uv venv`) instead of the prebuilt frameworks/conda env.

For each system, loads everything the canonical environment module
would have loaded as deps, but NOT the framework/conda module
itself — so a user can do:

    ezpz_load_modules_aurora           # or _sunspot / _polaris
    uv venv --python=$(which python3)
    source .venv/bin/activate
    uv pip install ...

without dragging in the prebuilt PyTorch / Python.

- aurora / sunspot: load the same oneAPI + HDF5 + PTI stack as
  ezpz_setup_xpu, set the matching XPU runtime envvars (ZE flat
  hierarchy, CCL launcher, oneAPI device selector). Aurora also
  exports FI_MR_CACHE_MONITOR=userfaultfd for parity with
  ezpz_setup_conda_aurora.

- polaris: load PrgEnv-gnu, craype-x86-milan, cray-hdf5-parallel,
  cudnn, gcc-native — the same deps the polaris conda module pulls
  in. Mirror the conda module's envvar set: CC/CXX, CUDA paths +
  PATH/LD_LIBRARY_PATH, NCCL + TensorRT lib paths, ALCF
  http(s)_proxy, MPI/JAX hints. Reverse-engineered from
  /soft/modulefiles/conda/2025-09-25.lua so we stay in sync with
  the canonical module's behavior.
Copilot AI review requested due to automatic review settings May 5, 2026 14:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 5, 2026

Reviewer's Guide

Adds three system-specific helper functions to load ALCF Aurora, Sunspot, and Polaris module stacks (mirroring their canonical frameworks/conda modules minus the actual framework/conda activation) so users can build custom Python virtualenvs atop the standard HPC environments, including matching runtime environment variables.

Sequence diagram for ezpz_load_modules_polaris and venv creation

sequenceDiagram
  actor User
  participant Shell
  participant ModuleSystem
  participant Environment

  User->>Shell: ezpz_load_modules_polaris
  Shell->>ModuleSystem: module use /soft/modulefiles
  ModuleSystem-->>Shell: Module search path updated
  Shell->>ModuleSystem: module load PrgEnv-gnu craype-x86-milan cray-hdf5-parallel cudnn gcc-native
  ModuleSystem-->>Shell: Modules loaded
  Shell->>Environment: export CC CXX
  Shell->>Environment: export CUDA_HOME CUDA_PATH CUDA_TOOLKIT_BASE
  Shell->>Environment: update PATH LD_LIBRARY_PATH for CUDA
  Shell->>Environment: set TORCH_CUDA_ARCH_LIST FLASHINFER_CUDA_ARCH_LIST
  Shell->>Environment: set NCCL_and_TensorRT_paths
  Shell->>Environment: set http_proxy https_proxy
  Shell->>Environment: set MPICH_GPU_SUPPORT_ENABLED MPI4JAX_USE_CUDA_MPI
  Shell->>Environment: set XLA_FLAGS XLA_PYTHON_CLIENT_PREALLOCATE

  User->>Shell: uv venv --python=$(which python3)
  Shell->>ModuleSystem: resolve python3 from loaded stack
  ModuleSystem-->>Shell: /path/to/system/python3
  Shell->>Shell: create .venv with system python
  User->>Shell: source .venv/bin/activate
  Shell->>Environment: activate virtualenv
Loading

Flow diagram for selecting ezpz_load_modules function by system

flowchart TD
  Start([Start])
  S["Select target system"]
  A[Aurora]
  Su[Sunspot]
  P[Polaris]

  Start --> S
  S --> A
  S --> Su
  S --> P

  A --> A1["Call ezpz_load_modules_aurora"]
  A1 --> A2["module load oneapi/release/2025.3.1 hdf5 pti-gpu"]
  A2 --> A3["Export XPU runtime envvars"]
  A3 --> A4["Export FI_MR_CACHE_MONITOR=userfaultfd (if unset)"]

  Su --> Su1["Call ezpz_load_modules_sunspot"]
  Su1 --> Su2["module load oneapi/release/2025.3.1 hdf5 pti-gpu"]
  Su2 --> Su3["Export XPU runtime envvars"]

  P --> P1["Call ezpz_load_modules_polaris"]
  P1 --> P2["module use /soft/modulefiles"]
  P2 --> P3["module load PrgEnv-gnu craype-x86-milan cray-hdf5-parallel cudnn gcc-native"]
  P3 --> P4["Export CC CXX CUDA_* paths"]
  P4 --> P5["Export NCCL TensorRT HTTP proxy vars"]
  P5 --> P6["Export MPICH GPU JAX XLA envvars"]

  A4 --> V["uv venv --python=$(which python3) and activate"]
  Su3 --> V
  P6 --> V
  V --> End([Custom venv on system stack])
Loading

File-Level Changes

Change Details Files
Introduce ezpz_load_modules_aurora to load Aurora's oneAPI/HDF5/PTI-GPU stack and XPU runtime environment without activating the frameworks module.
  • Add ezpz_load_modules_aurora function that loads oneapi/release/2025.3.1, hdf5, and pti-gpu modules.
  • Export XPU-related environment variables (ZE_FLAT_DEVICE_HIERARCHY, CCL_PROCESS_LAUNCHER, CCL_OP_SYNC, ONEAPI_DEVICE_SELECTOR, TORCH_CPP_LOG_LEVEL).
  • Set Aurora-specific FI_MR_CACHE_MONITOR default to userfaultfd to match existing conda setup.
src/ezpz/bin/utils.sh
Introduce ezpz_load_modules_sunspot to mirror Aurora's module and XPU runtime setup for Sunspot, without the frameworks module or Aurora-specific FI settings.
  • Add ezpz_load_modules_sunspot function that loads oneapi/release/2025.3.1, hdf5, and pti-gpu modules.
  • Export the same core XPU-related environment variables as Aurora (excluding FI_MR_CACHE_MONITOR).
src/ezpz/bin/utils.sh
Introduce ezpz_load_modules_polaris to reproduce Polaris conda module dependencies and environment variables without loading the conda module itself.
  • Add ezpz_load_modules_polaris function that adjusts MODULEPATH and loads PrgEnv-gnu, craype-x86-milan, cray-hdf5-parallel/1.14.3.5, cudnn/9.13.0, and gcc-native/14.2.
  • Set compiler shim variables CC/CXX to system gcc-14/g++-14 for C/C++ extension builds.
  • Configure CUDA-related variables and PATH/LD_LIBRARY_PATH to point at CUDA 12.9.1, including CUPTI, and set TORCH_CUDA_ARCH_LIST and FLASHINFER_CUDA_ARCH_LIST for SM 8.0.
  • Add NCCL and TensorRT paths into PATH/LD_LIBRARY_PATH to align with the Polaris conda module versions.
  • Export ALCF HTTP/HTTPS proxy and MPI/JAX-related environment variables (MPICH_GPU_SUPPORT_ENABLED, MPI4JAX_USE_CUDA_MPI, XLA_FLAGS, XLA_PYTHON_CLIENT_PREALLOCATE).
src/ezpz/bin/utils.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6f9b86c63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ezpz/bin/utils.sh
export CUDA_PATH="${CUDA_HOME}"
export CUDA_TOOLKIT_BASE="${CUDA_HOME}"
export PATH="${CUDA_HOME}/bin:${PATH}"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH:-}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid exporting empty LD_LIBRARY_PATH entry

When LD_LIBRARY_PATH is unset, this assignment expands to a trailing : (empty path element), which makes the dynamic loader search the current working directory for shared libraries. In ezpz_load_modules_polaris, that can lead to unexpected library resolution (or loading unintended .so files) on clean shells before launching jobs, causing hard-to-debug runtime behavior. Build this string conditionally so no empty segment is introduced when the variable is initially absent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Aurora and Sunspot share identical module loads and XPU-related environment exports; consider factoring this into a single helper (e.g., ezpz_load_modules_xpu_common) to avoid future drift between the two stacks.
  • In ezpz_load_modules_polaris, adding ${_nccl}/include to PATH is unusual for header directories and may be unintended; you may want to export it via CPATH/C_INCLUDE_PATH instead or drop it if not needed.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Aurora and Sunspot share identical module loads and XPU-related environment exports; consider factoring this into a single helper (e.g., `ezpz_load_modules_xpu_common`) to avoid future drift between the two stacks.
- In `ezpz_load_modules_polaris`, adding `${_nccl}/include` to `PATH` is unusual for header directories and may be unintended; you may want to export it via `CPATH`/`C_INCLUDE_PATH` instead or drop it if not needed.

## Individual Comments

### Comment 1
<location path="src/ezpz/bin/utils.sh" line_range="1015" />
<code_context>
+	local _nccl="/soft/libraries/nccl/nccl_2.28.3-1+cuda12.9_x86_64"
+	local _trt="/soft/libraries/trt/TensorRT-10.13.3.9.Linux.x86_64-gnu.cuda-12.9"
+	export PATH="${_nccl}/include:${PATH}"
+	export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH}"
+
+	# ALCF HTTP proxy (compute nodes can't reach the internet otherwise).
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard LD_LIBRARY_PATH expansion to avoid issues under `set -u`

Here you reference `LD_LIBRARY_PATH` without guarding for the unset case, unlike earlier where you used `"${LD_LIBRARY_PATH:-}"`. Under `set -u` this will cause the function to fail. Please align this line with the earlier pattern:

```sh
export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH:-}"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/ezpz/bin/utils.sh
local _nccl="/soft/libraries/nccl/nccl_2.28.3-1+cuda12.9_x86_64"
local _trt="/soft/libraries/trt/TensorRT-10.13.3.9.Linux.x86_64-gnu.cuda-12.9"
export PATH="${_nccl}/include:${PATH}"
export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Guard LD_LIBRARY_PATH expansion to avoid issues under set -u

Here you reference LD_LIBRARY_PATH without guarding for the unset case, unlike earlier where you used "${LD_LIBRARY_PATH:-}". Under set -u this will cause the function to fail. Please align this line with the earlier pattern:

export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH:-}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants