feat(grpo): context-parallel (CP) loss alignment and reduction by RUFFY-369 · Pull Request #66 · NousResearch/torchtitan

RUFFY-369 · 2026-04-02T10:58:11Z

Summary

This PR introduces Context-Parallel (CP) Loss Alignment for GRPO. It ensures that when training with sequence parallelism (CP), the GRPO loss and advantage reductions are correctly synchronized across the CP mesh workers to maintain mathematical correctness during the backward pass.

Technical Context

In a Context Parallel configuration, a single sequence is split across multiple GPUs to manage memory. Standard loss reduction often fails to account for these split segments, leading to inconsistent gradients.

This implementation introduces a specialized CP-alignment layer that:

Hooks into the ContextParallel dispatcher to correctly scale loss values by the CP degree.
Synchronizes Advantage and KL-divergence terms across the CP mesh before they are used for weight updates.
Ensures bit-exact gradient parity between CP and non-CP training runs.

Key Changes

torchtitan/distributed/context_parallel.py: Added _enable_context_parallel_dispatcher and loss reduction hooks for GRPO.
torchtitan/grpo/grpo_step.py: Updated the loss computation logic to be sequence-mesh aware.
torchtitan/grpo/utils.py: Implementation of masked_mean and masked_sum primitives that correctly handle CP-mesh boundaries.

Modernization & Compatibility

To support modern hardware and the latest PyTorch standards, this PR includes foundational modernization for PyTorch 2.5.1+.

Backward Compatible: Uses try...except and version guards to remain fully compatible with the existing PyTorch 2.3/2.4 baseline in the dev-updated-again fork.
Experimental Namespace Support: Aligns with the refactored torch.distributed.tensor.experimental._attention namespace in PyTorch 2.5+.

Verification Results (vast.ai)

Hardware Profile: Verified on a vast.ai cluster with 2x RTX 3090 GPUs (24GB VRAM).
Scale: Tested with CP-degree 2, verifying that sequence-split training produces identical weights to single-seq training.
Tests: Successfully ran scripts/verify_grpo_2gpu.sh, confirming that CP-mesh synchronization occurs without deadlocks.
Cluster Stability: Verified that CP collectives for loss alignment do not introduce memory fragmentation on 24GB VRAM units.

…rom integration branch

…ntegration branch

…rnization baseline

…ling

- Purged AI-generated Unicode separators and ASCII decorative boxes. - Removed conversational fillers and redundant documentation artifacts. - Standardized indentation and modernized technical documentation.

…ndling.py" This reverts commit 12acffe.

RUFFY-369 added 14 commits March 31, 2026 12:32

[infra] feat: asynchronous vLLM weight syncing with background threading

d6af7f3

Refactor: Sync production-grade Async Weight Updater and Job Config f…

45374bf

…rom integration branch

Integration: Sync refined Asynchronous Weight Update injection from i…

2efbe59

…ntegration branch

Test: Include 2-GPU smoke test script

c68e348

Test: Include GRPO smoke test config

7fa62f6

Modernize for PT 2.5.1 (RECONSTRUCTED): Consolidated 2x RTX 3090 mode…

9d09187

…rnization baseline

chore(grpo): purify PyTorch 2.5.1 baseline from vllm-async logic

0493e4c

feat(grpo): re-inject Compute-Parallel alignment to sequence loss sca…

429766a

…ling

chore(grpo): sanitize cp-alignment branch for upstream compatibility

8d0e5da

chore(grpo): remove redundant modernization patch file for clean PR

236e3d9

chore(grpo): remove redundant .rej file residue

db3e793

chore(grpo): systematic sanitization of CP alignment infrastructure

64e8983

- Purged AI-generated Unicode separators and ASCII decorative boxes. - Removed conversational fillers and redundant documentation artifacts. - Standardized indentation and modernized technical documentation.

surgical sanitization: removed developer artifacts in data_handling.py

12acffe

Revert "surgical sanitization: removed developer artifacts in data_ha…

a9bc073

…ndling.py" This reverts commit 12acffe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grpo): context-parallel (CP) loss alignment and reduction#66

feat(grpo): context-parallel (CP) loss alignment and reduction#66
RUFFY-369 wants to merge 14 commits into
NousResearch:dev-updated-againfrom
RUFFY-369:infra/grpo-cp-alignment

RUFFY-369 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RUFFY-369 commented Apr 2, 2026

Summary

Technical Context

Key Changes

Modernization & Compatibility

Verification Results (vast.ai)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant