Skip to content

fix(dynamics): mask stale neighbor matrix entries before COO conversion#54

Merged
laserkelvin merged 6 commits intoNVIDIA:mainfrom
laserkelvin:fix-stale-coo-neighbors
Apr 10, 2026
Merged

fix(dynamics): mask stale neighbor matrix entries before COO conversion#54
laserkelvin merged 6 commits intoNVIDIA:mainfrom
laserkelvin:fix-stale-coo-neighbors

Conversation

@laserkelvin
Copy link
Copy Markdown
Collaborator

Summary

  • Fix stale edges in COO neighbor list output when neighbor counts shrink between Verlet skin rebuilds
  • Cache _col_range tensor to avoid per-step allocation in _update_edges_group
  • Add TestStaleCOOEntries regression test covering the shrinking-neighbors scenario

Problem

When NeighborListHook uses a Verlet skin for selective rebuilds, the upstream warp kernel (_update_neighbor_matrix) only writes slots 0..num_neighbors[i]-1 in the neighbor matrix. If an atom's neighbor count decreases between rebuilds, the old entries in neighbor_matrix[i, new_count:old_count] are never overwritten with fill_value — they retain stale atom indices from the previous build.

get_neighbor_list_from_neighbor_matrix uses neighbor_matrix != fill_value to identify valid entries, so these stale indices pass the mask and appear as phantom edges in the COO output. This causes energy/force mismatches during MD compared to ASE + model.

Fix

In _update_edges_group, before calling get_neighbor_list_from_neighbor_matrix, create a boolean mask col_range >= num_neighbors[i] and overwrite stale slots with fill_value in-place. The col_range tensor is cached in _alloc_output_tensors alongside the neighbor matrix buffers.

The in-place overwrite is safe because the next rebuild will rewrite all valid slots via atomic_add.

Test

TestStaleCOOEntries sets up a 3-atom line system, runs an initial rebuild, then moves an atom far enough to trigger a Verlet skin rebuild with fewer neighbors. It verifies:

  1. No stale edges remain in the COO edge_index
  2. Per-atom COO edge counts match the internal num_neighbors buffer

When neighbor counts shrink between Verlet skin rebuilds, the upstream
kernel leaves stale atom indices in unused neighbor_matrix slots. The
COO converter's fill_value mask picks these up as valid edges, causing
phantom edges that corrupt energy/force calculations.

Before calling get_neighbor_list_from_neighbor_matrix, overwrite slots
beyond num_neighbors[i] with fill_value in-place. Cache the column
range tensor to avoid per-step allocation.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@laserkelvin laserkelvin added the bug Something isn't working label Apr 8, 2026
@laserkelvin
Copy link
Copy Markdown
Collaborator Author

/ok to test d8824dd

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes phantom COO edges caused by stale entries in the neighbor matrix when neighbor counts shrink between Verlet skin rebuilds. The fix in _update_edges_group masks all slots ≥ num_neighbors[i] with fill_value in-place before passing the matrix to get_neighbor_list_from_neighbor_matrix, and caches _col_range to avoid per-step allocation. A regression test covering the shrinking-neighbor scenario is included.

Important Files Changed

Filename Overview
nvalchemi/hooks/neighbor_list.py Core fix: in-place stale-entry masking added to _update_edges_group before COO conversion; _col_range properly allocated in _alloc_output_tensors and initialised to None in __init__.
test/dynamics/test_neighbor_list_hook.py New TestStaleCOOEntries class correctly validates the shrinking-neighbor scenario and verifies COO edge counts against the internal _num_neighbors buffer.

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment thread nvalchemi/hooks/neighbor_list.py
Match the pattern used by all other internal buffers (_neighbor_matrix,
_num_neighbors, _neighbor_shifts, _rebuild_flags) to prevent AttributeError
if the attribute is inspected before the first _alloc_output_tensors call.
@laserkelvin
Copy link
Copy Markdown
Collaborator Author

/ok to test 0f6d192

@laserkelvin laserkelvin requested a review from zubatyuk April 8, 2026 15:19
@laserkelvin
Copy link
Copy Markdown
Collaborator Author

/ok to test d767402

@laserkelvin laserkelvin enabled auto-merge April 9, 2026 02:47
…name

The merge with main brought attribute renames (_neighbor_shifts ->
_neighbor_matrix_shifts, edge_index -> neighbor_list) that left two
references using the old names, causing an AttributeError in
_update_edges_group and TestStaleCOOEntries.
@laserkelvin
Copy link
Copy Markdown
Collaborator Author

/ok to test 352a160

@laserkelvin laserkelvin added this pull request to the merge queue Apr 10, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Apr 10, 2026
@laserkelvin laserkelvin enabled auto-merge April 10, 2026 05:15
@laserkelvin
Copy link
Copy Markdown
Collaborator Author

/ok to test 57883de

@laserkelvin laserkelvin added this pull request to the merge queue Apr 10, 2026
Merged via the queue into NVIDIA:main with commit bb2e1d5 Apr 10, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants