rel-notes + version update (#166)

hsadasivan · web-flow · commit eb0ffd727fca · 2025-09-04T14:33:58.000-07:00
* rel-notes + version update

* update version

* update changelog wording
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,18 @@
 ## Latest Changes
 
+## 0.6.1 (2025-09-04)
+
+### Added
+- [Torch/JAX] Support for variable leading batch dimensions in triangle multiplicative update
+- [Torch/JAX] Triangle attention kernel support for additional input configs: all hidden_dim<=32 and divisible by 4 for tf32/fp32, and for all hidden_dim<=128 and divisible by 8 for bf16/fp16. In the rare instance that the kernel does not support an input config, fallback to torch is enabled instead of erroring out.
+- [Torch/JAX] Tuned config for RTX PRO 6000 GPUs for triangle multiplicative update.
+- [JAX] vmap support for triangle multiplicative update and triangle attention
+- [Torch] Improved error reporting on import failure with traceback information for stacktrace
+
+### Bug fix
+- [Torch/JAX] Fixed illegal memory access issue stemming from int32 indexing for longer sequences in triangle multiplicative update and attention with pair bias.
 ## 0.6.0 (2025-08-11)
+- [JAX] Moved to using nondiff_argnums instead of nondiff_argnames to be compatible with older JAX versions
 
 ### Added
 - [Torch] New feature: Added `cuet.attention_pair_bias` (support for caching the pair bias tensor & further kernel acceleration coming up soon. There maybe API related changes for this in the next release)
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.6.1rc2
+0.6.1
diff --git a/cuequivariance_torch/cuequivariance_torch/primitives/triangle.py b/cuequivariance_torch/cuequivariance_torch/primitives/triangle.py
@@ -70,7 +70,7 @@ def triangle_attention(
     Notes:
         (1) Context is saved for backward pass. You don't need to save it manually.
         (2) Kernel precision (fp32, bf16, fp16) is based on input dtypes. For tf32, set it from torch global scope
-        (3) **Limitation**: Full FP32 is not supported for backward pass. Please set `torch.backends.cuda.matmul.allow_tf32=True`.
+        (3) Triangle attention kernel supports: all hidden_dim<=32 and divisible by 4 for tf32/fp32, and for all hidden_dim<=128 and divisible by 8 for bf16/fp16. In the rare instance that the kernel does not support an input config, fallback to torch is enabled instead of erroring out.
 
     Example:
         >>> import torch
@@ -195,6 +195,7 @@ def triangle_multiplicative_update(
         (3) **Limitation**: Currently only supports hidden_dim values that are multiples of 32.
         (4) We have moved away from the default round-towards-zero (RZ) implementation to round-nearest (RN) for better tf32 accuracy in cuex.triangle_multiplicative_update. In rare circumstances, this may cause minor differences in results observed.
         (5) When using torch compile, use `cueuivariance_ops_torch.init_triton_cache()` to initialize triton cache before calling torch compiled triangular multiplicative update.
+        (6) Although the example demonstrates the most common case of one batch dimension, the API supports variable number of leading batch dimensions.
 
     Example:
         >>> import torch