fix(dynamics): mask stale neighbor matrix entries before COO conversion#54
Merged
laserkelvin merged 6 commits intoNVIDIA:mainfrom Apr 10, 2026
Merged
Conversation
When neighbor counts shrink between Verlet skin rebuilds, the upstream kernel leaves stale atom indices in unused neighbor_matrix slots. The COO converter's fill_value mask picks these up as valid edges, causing phantom edges that corrupt energy/force calculations. Before calling get_neighbor_list_from_neighbor_matrix, overwrite slots beyond num_neighbors[i] with fill_value in-place. Cache the column range tensor to avoid per-step allocation.
Collaborator
Author
|
/ok to test d8824dd |
Contributor
Greptile SummaryThis PR fixes phantom COO edges caused by stale entries in the neighbor matrix when neighbor counts shrink between Verlet skin rebuilds. The fix in Important Files Changed
Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile |
Match the pattern used by all other internal buffers (_neighbor_matrix, _num_neighbors, _neighbor_shifts, _rebuild_flags) to prevent AttributeError if the attribute is inspected before the first _alloc_output_tensors call.
Collaborator
Author
|
/ok to test 0f6d192 |
Collaborator
Author
|
/ok to test d767402 |
…name The merge with main brought attribute renames (_neighbor_shifts -> _neighbor_matrix_shifts, edge_index -> neighbor_list) that left two references using the old names, causing an AttributeError in _update_edges_group and TestStaleCOOEntries.
Collaborator
Author
|
/ok to test 352a160 |
dallasfoster
approved these changes
Apr 10, 2026
Collaborator
Author
|
/ok to test 57883de |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_col_rangetensor to avoid per-step allocation in_update_edges_groupTestStaleCOOEntriesregression test covering the shrinking-neighbors scenarioProblem
When
NeighborListHookuses a Verlet skin for selective rebuilds, the upstream warp kernel (_update_neighbor_matrix) only writes slots0..num_neighbors[i]-1in the neighbor matrix. If an atom's neighbor count decreases between rebuilds, the old entries inneighbor_matrix[i, new_count:old_count]are never overwritten withfill_value— they retain stale atom indices from the previous build.get_neighbor_list_from_neighbor_matrixusesneighbor_matrix != fill_valueto identify valid entries, so these stale indices pass the mask and appear as phantom edges in the COO output. This causes energy/force mismatches during MD compared to ASE + model.Fix
In
_update_edges_group, before callingget_neighbor_list_from_neighbor_matrix, create a boolean maskcol_range >= num_neighbors[i]and overwrite stale slots withfill_valuein-place. Thecol_rangetensor is cached in_alloc_output_tensorsalongside the neighbor matrix buffers.The in-place overwrite is safe because the next rebuild will rewrite all valid slots via
atomic_add.Test
TestStaleCOOEntriessets up a 3-atom line system, runs an initial rebuild, then moves an atom far enough to trigger a Verlet skin rebuild with fewer neighbors. It verifies:edge_indexnum_neighborsbuffer