adding NVIDIA_TF32_OVERRIDE=0 to test_numerics.py#3014
adding NVIDIA_TF32_OVERRIDE=0 to test_numerics.py#3014francesco-bertolotti wants to merge 1 commit into
Conversation
Greptile SummaryThis PR adds
Confidence Score: 4/5Safe to merge — the change is a one-line addition of a well-understood environment variable that forces FP32 precision, consistent with how other numerics-sensitive tests in the same script are already configured. The change is minimal and follows an established pattern in the file. The only gap is a missing inline comment explaining the rationale, which the test_mhc.py line directly below already has. No functional risk is introduced. No files require special attention. test_cuda_graphs.py shares the same determinism flags but does not get NVIDIA_TF32_OVERRIDE=0; this may be intentional but is worth a quick sanity check. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[test.sh] --> B[test_numerics.py NVIDIA_TF32_OVERRIDE=0 NEW]
A --> C[test_cuda_graphs.py no NVIDIA_TF32_OVERRIDE]
A --> D[test_mhc.py NVIDIA_TF32_OVERRIDE=0 + inline comment]
B -->|disables TF32 on Ampere+| E[full FP32 precision]
D -->|same effect| E
Reviews (1): Last reviewed commit: "adding NVIDIA_TF32_OVERRIDE=0 to test_nu..." | Re-trigger Greptile |
| python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_custom_recipe.xml $TE_PATH/tests/pytorch/test_custom_recipe.py || test_fail "test_custom_recipe.py" | ||
| python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_deferred_init.xml $TE_PATH/tests/pytorch/test_deferred_init.py || test_fail "test_deferred_init.py" | ||
| PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 NVTE_FUSED_ATTN=0 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_numerics.xml $TE_PATH/tests/pytorch/test_numerics.py || test_fail "test_numerics.py" | ||
| PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 NVTE_FUSED_ATTN=0 NVIDIA_TF32_OVERRIDE=0 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_numerics.xml $TE_PATH/tests/pytorch/test_numerics.py || test_fail "test_numerics.py" |
There was a problem hiding this comment.
The
test_mhc.py line directly below carries an inline comment explaining why NVIDIA_TF32_OVERRIDE=0 is needed. Adding a similar comment here would keep the rationale self-documented and help future readers understand why this flag is required for numerical tests.
| PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 NVTE_FUSED_ATTN=0 NVIDIA_TF32_OVERRIDE=0 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_numerics.xml $TE_PATH/tests/pytorch/test_numerics.py || test_fail "test_numerics.py" | |
| # Disable TF32 path to fully align with the pytorch reference implementation's precision (avoids layer norm numerical mismatches on Ampere+) | |
| PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 NVTE_FUSED_ATTN=0 NVIDIA_TF32_OVERRIDE=0 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_numerics.xml $TE_PATH/tests/pytorch/test_numerics.py || test_fail "test_numerics.py" |
|
I do not know if it helps, these are the tests failing without the tf32 env flag: FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[True-True-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [50] with -3.5490684509277344 vs -3.555588245391...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[True-True-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [61] with 2.6603221893310547 vs 2.65149784088134...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[True-False-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [50] with -3.5490684509277344 vs -3.555588245391...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[True-False-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [61] with 2.6603221893310547 vs 2.65149784088134...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[False-True-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [50] with -3.5490684509277344 vs -3.555588245391...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[False-True-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [61] with 2.6603221893310547 vs 2.65149784088134...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[False-False-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [50] with -3.5490684509277344 vs -3.555588245391...
FAILED tests/pytorch/test_numerics.py::test_linear_accuracy[False-False-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=0. Maximum difference at location [61] with 2.6603221893310547 vs 2.65149784088134...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-True-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=113. Maximum difference at location [0, 165] with 0.04553138092160225 vs 0.0458293...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-True-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-True-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-True-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-False-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-False-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-False-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-True-False-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-True-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-True-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-True-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-True-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-False-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-False-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-False-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[True-False-False-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-True-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-True-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-True-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-True-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-False-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-False-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-False-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-True-False-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-True-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-True-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-True-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-True-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-False-LayerNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-False-LayerNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=11. Maximum difference at location [0, 21] with 0.029382700100541115 vs 0.02909549...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-False-RMSNorm-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=56. Maximum difference at location [0, 209] with 0.0030347853899002075 vs 0.003319...
FAILED tests/pytorch/test_numerics.py::test_layernorm_linear_accuracy[False-False-False-RMSNorm-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=28. Maximum difference at location [1, 421] with -0.017969021573662758 vs -0.01827...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-LayerNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=14. Maximum difference at location [0, 99] with -0.07819076627492905 vs -0.1000889...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-LayerNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=7. Maximum difference at location [0, 99] with -0.07819075882434845 vs -0.10008895...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-LayerNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=105. Maximum difference at location [75] with -0.04723335802555084 vs -0.014243453...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-LayerNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0] with -1.3210333585739136 vs -1.299375057220...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-RMSNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=50. Maximum difference at location [100] with 0.18327626585960388 vs 0.38112670183...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-RMSNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=121. Maximum difference at location [1, 15] with -0.0071725500747561455 vs 0.01547...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-RMSNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=351. Maximum difference at location [77] with -0.04859239235520363 vs -0.076162673...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-True-RMSNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=263. Maximum difference at location [56] with 0.03622261434793472 vs 0.12370993196...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-LayerNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=14. Maximum difference at location [0, 99] with -0.07819076627492905 vs -0.1000889...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-LayerNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=7. Maximum difference at location [0, 99] with -0.07819075882434845 vs -0.10008895...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-LayerNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=105. Maximum difference at location [75] with -0.04723335802555084 vs -0.014243453...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-LayerNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0] with -1.3210333585739136 vs -1.299375057220...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-RMSNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=50. Maximum difference at location [100] with 0.18327626585960388 vs 0.38112670183...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-RMSNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=121. Maximum difference at location [1, 15] with -0.0071725500747561455 vs 0.01547...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-RMSNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=351. Maximum difference at location [77] with -0.04859239235520363 vs -0.076162673...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[True-False-RMSNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=263. Maximum difference at location [56] with 0.03622261434793472 vs 0.12370993196...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-LayerNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=14. Maximum difference at location [0, 99] with -0.07819076627492905 vs -0.1000889...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-LayerNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=7. Maximum difference at location [0, 99] with -0.07819075882434845 vs -0.10008895...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-LayerNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=105. Maximum difference at location [75] with -0.04723335802555084 vs -0.014243453...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-LayerNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0] with -1.3210333585739136 vs -1.299375057220...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-RMSNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=50. Maximum difference at location [100] with 0.18327626585960388 vs 0.38112670183...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-RMSNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=121. Maximum difference at location [1, 15] with -0.0071725500747561455 vs 0.01547...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-RMSNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=351. Maximum difference at location [77] with -0.04859239235520363 vs -0.076162673...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-True-RMSNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=263. Maximum difference at location [56] with 0.03622261434793472 vs 0.12370993196...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-LayerNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=14. Maximum difference at location [0, 99] with -0.07819076627492905 vs -0.1000889...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-LayerNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=7. Maximum difference at location [0, 99] with -0.07819075882434845 vs -0.10008895...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-LayerNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=105. Maximum difference at location [75] with -0.04723335802555084 vs -0.014243453...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-LayerNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=22. Maximum difference at location [0] with -1.3210333585739136 vs -1.299375057220...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-RMSNorm-relu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=50. Maximum difference at location [100] with 0.18327626585960388 vs 0.38112670183...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-RMSNorm-relu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=121. Maximum difference at location [1, 15] with -0.0071725500747561455 vs 0.01547...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-RMSNorm-reglu-small-1-dtype0] - AssertionError: Outputs not close enough in tensor at idx=351. Maximum difference at location [77] with -0.04859239235520363 vs -0.076162673...
FAILED tests/pytorch/test_numerics.py::test_layernorm_mlp_accuracy[False-False-RMSNorm-reglu-small-2-dtype0] - AssertionError: Outputs not close enough in tensor at idx=263. Maximum difference at location [56] with 0.03622261434793472 vs 0.12370993196... |
PR splitted from #3013
I have added NVIDIA_TF32_OVERRIDE=0 to test_numerics.py otherwise I would get test failing for small numerical mismatch with layer norms. This has also been done for test_mhc.py.