Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

📝Read the paper on arXiv

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control.

Fig 1. DiT-S/4 samples without (left) and with (right) magnitude preserving layers.

This project builds upon key concepts from the following research papers:

Peebles & Xie (2023) explore the application of transformer architectures to diffusion models, achieving state-of-the-art performance on various generation tasks;
Karras et al. (2024) introduce the idea of preserving the magnitude of features during the diffusion process, enhancing the stability and quality of generated outputs.

🚧 Code Status: Work in Progress

We're actively developing this repo. Contributions and feedback are welcome!

Training

python train.py --data-path /path/to/data --results-dir /path/to/results --model DiT-S/2 --num-steps 400_000 <map feature flags>

Magnitude Preservation Flags

Customize the training process by enabling the following flags:

--use-cosine-attention - Controls weight growth in attention layers.
--use-weight-normalization - Applies magnitude preservation in linear layers.
--use-forced-weight-normalization - Controls weight growth in linear layers.
--use-mp-residual - Enables magnitude preservation in residual connections.
--use-mp-silu - Uses a magnitude-preserving version of SiLU nonlinearity.
--use-no-layernorm - Disables transformer layer normalization.
--use-mp-pos-enc - Activates magnitude-preserving positional encoding.
--use-mp-embedding - Uses magnitude-preserving embeddings.

Sampling

python sample.py --result-dir /path/to/results/<dir> --class-label <class label>

Citation

@misc{bill2025exploringmagnitudepreservationrotation,
      title={Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers}, 
      author={Eric Tillman Bill and Cristian Perez Jensen and Sotiris Anagnostidis and Dimitri von Rütte},
      year={2025},
      eprint={2505.19122},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.19122}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

📝Read the paper on arXiv

🚧 Code Status: Work in Progress

Training

Magnitude Preservation Flags

Sampling

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

📝Read the paper on arXiv

🚧 Code Status: Work in Progress

Training

Magnitude Preservation Flags

Sampling

Citation