Skip to content

Model Refactor#66

Merged
dallasfoster merged 18 commits intoNVIDIA:mainfrom
dallasfoster:dallasf/model-refactor
Apr 10, 2026
Merged

Model Refactor#66
dallasfoster merged 18 commits intoNVIDIA:mainfrom
dallasfoster:dallasf/model-refactor

Conversation

@dallasfoster
Copy link
Copy Markdown
Collaborator

ALCHEMI Toolkit Pull Request

Description

Refactor the model subsystem to unify ModelCard and ModelConfig into a single ModelConfig with capability (frozen) and
runtime (mutable) fields, replace ComposableModelWrapper with a pipeline-based composition system (PipelineModelWrapper),
remove the model registry, and add first-class AIMNet2 support.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Changes Made

  • Unified ModelConfig: Merged ModelCard (capability declaration) and ModelConfig (runtime control) into a single
    ModelConfig class. Capability fields use frozenset to signal immutability; runtime fields (active_outputs, gradient_keys)
    remain mutable. All model wrappers updated to the new interface.
  • Pipeline composition: Replaced ComposableModelWrapper with PipelineModelWrapper, PipelineGroup, and PipelineStep.
    Groups define mini-pipelines with independent derivative strategies; the + operator provides simple additive composition.
    Supports dependent model chains (e.g., AIMNet2 charges wired into Ewald via wire mappings).
  • Removed model registry: Deleted nvalchemi/models/registry.py and associated tests.
  • New AIMNet2Wrapper: Added nvalchemi/models/aimnet2.py with full AIMNet2 model support.
  • New nvalchemi.neighbors module: Standalone compute_neighbors() function for one-shot neighbor list construction outside
    dynamics loops.
  • New nvalchemi/models/_utils.py: Extracted shared utilities (autograd derivative computation, stress/force helpers) used
    across model wrappers.
  • Updated all model wrappers (DemoModelWrapper, DFTD3ModelWrapper, EwaldModelWrapper, LennardJonesModelWrapper,
    MACEWrapper, PMEModelWrapper) to use the new ModelConfig interface and BaseModelMixin contract.
  • Updated NeighborListHook to work with the refactored ModelConfig.neighbor_config.
  • Reorganized tests: Split monolithic test_physical_models.py into per-model test files (test_ewald.py, test_pme.py,
    test_dftd3.py, test_aimnet2.py, test_pipeline.py). Added test_base.py for ModelConfig and BaseModelMixin unit tests.
  • Updated all examples and docs to use the new API.

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

Breaking changes for downstream users:

  • ModelCard no longer exists — use ModelConfig with outputs, autograd_outputs, etc. instead of supports_* /
    forces_via_autograd booleans.
  • ComposableModelWrapper removed — migrate to PipelineModelWrapper or use the model1 + model2 operator.
  • model_config.compute_forces, compute_stresses, etc. replaced by model_config.active_outputs = {"energy", "forces", ...}.
  • Model registry (register_model, list_foundation_models, etc.) removed.

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dallasfoster dallasfoster requested a review from laserkelvin April 9, 2026 23:11
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

Large architectural refactor unifying ModelCard+ModelConfig into a single ModelConfig, replacing ComposableModelWrapper with a pipeline-based composition system, removing the model registry, and adding first-class AIMNet2Wrapper support. The previous round of P0/P1 issues has been addressed: weights_only=True is now used in PipelineModelWrapper.load, NeighborConfig.skin is correctly forwarded, _configure_sub_models no longer permanently mutates sub-model configs, standalone AIMNet2 stress is implemented via the affine strain trick, and positions/cell are restored after strain. The remaining unfixed issues from prior threads (requires_grad_(True) on non-leaf scaled_pos in AIMNet2Wrapper.adapt_input, and the PipelineStep docstring wire example) are unchanged. New comments in this round are all P2 quality suggestions.

Important Files Changed

Filename Overview
nvalchemi/models/base.py Unified ModelConfig (capability+runtime fields) and BaseModelMixin interface; skin forwarding to NeighborListHook fixed; no new issues found
nvalchemi/models/pipeline.py New PipelineModelWrapper with grouped autograd/direct execution; weights_only=True in load(), skin propagated, _configure_sub_models no longer permanently mutates sub-models — previous P0/P1 issues addressed
nvalchemi/models/aimnet2.py New AIMNet2Wrapper with external MATRIX neighbor list; stress implemented via affine strain trick; requires_grad_(True) on non-leaf scaled_pos still raises at runtime when stress is active (tracked in previous thread)
nvalchemi/models/_utils.py New shared utilities (autograd_forces, autograd_stresses, prepare_strain, sum_outputs); implementation looks correct
nvalchemi/neighbors.py New compute_neighbors one-shot API; shares _write_neighbor_data_to_batch helper with NeighborListHook; looks clean
nvalchemi/hooks/neighbor_list.py Updated to accept skin parameter; refactored to share _write_neighbor_data_to_batch with neighbors.py; no issues
nvalchemi/models/ewald.py Updated to new ModelConfig interface; required_inputs correctly declares charges; no issues
nvalchemi/models/mace.py Updated to new ModelConfig interface; adapt_output docstring incorrectly claims key renaming when keys are identical; no functional issues
nvalchemi/models/init.py Lazy imports for all wrappers; PipelineModelWrapper/PipelineGroup/PipelineStep added; registry removed cleanly
test/models/test_base.py Comprehensive new tests for ModelConfig, BaseModelMixin, and _utils; good coverage of edge cases
test/models/test_composable.py Migrated from ComposableModelWrapper to PipelineModelWrapper tests; covers construction, synthesis, forward shapes, and edge cases
nvalchemi/models/lj.py Updated to new ModelConfig interface; Warp-kernel virial negation and buffer reuse look correct

Reviews (9): Last reviewed commit: "fix skin dropping" | Re-trigger Greptile

Comment thread nvalchemi/models/pipeline.py Outdated
Comment on lines +143 to +157
outputs = {"energy", "forces", "stress", "charges"}
if self._is_nse:
outputs.add("spin_charges")

self.model_config = ModelConfig(
outputs=frozenset(outputs),
autograd_outputs=frozenset({"forces", "stress"}),
autograd_inputs=frozenset({"positions"}),
required_inputs=frozenset({"charge"}),
optional_inputs=frozenset(),
supports_pbc=True,
needs_pbc=False,
neighbor_config=None, # AIMNet2 manages its own neighbor list
active_outputs=outputs,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 "stress" advertised in outputs but standalone stress computation silently does nothing

"stress" is included in model_config.outputs (and in autograd_outputs) and in active_outputs, meaning output_data() returns it and adapt_output will include a "stress" key in every return value. However, the forward() implementation at lines 502–506 does:

if compute_stresses:
    # Stresses require a displacement tensor — not yet implemented
    # for standalone AIMNet2.
    pass

The key is never populated, so callers receive output["stress"] = None with no warning. Any downstream code that checks if "stress" in outputs or iterates over the dict will silently consume a None tensor, which can cause cryptic failures later (e.g. in sum_outputs).

Either remove "stress" from outputs/autograd_outputs until standalone stress is implemented, or emit a warnings.warn when the key is requested but cannot be satisfied:

# Remove stress from advertised capabilities until implemented
outputs = {"energy", "forces", "charges"}

Comment thread nvalchemi/models/pipeline.py Outdated
If the number of models doesn't match the saved config, or
if a group requires a derivative_fn that wasn't provided.
"""
checkpoint = torch.load(path, weights_only=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security weights_only=False allows arbitrary pickle execution

torch.load(..., weights_only=False) deserializes arbitrary Python objects, making it possible for a malicious checkpoint to execute arbitrary code. Since the saved format only needs primitive Python types (dict, list, str, bool, int), weights_only=True can be used for the config/metadata portion:

checkpoint = torch.load(path, weights_only=True)

If that is not sufficient for all saved dtypes, consider splitting into two loads — one weights_only=True for config metadata and one restricted call for weights — or using torch.serialization.add_safe_globals to allowlist only the needed types.

Comment thread nvalchemi/models/pipeline.py Outdated
Copy link
Copy Markdown
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is a large, well-structured refactor that unifies the model subsystem around ModelConfig (replacing the ModelCard/ModelConfig split), replaces ComposableModelWrapper with a more powerful PipelineModelWrapper supporting autograd groups and output wiring, and adds first-class AIMNet2 support. The architecture is a clear improvement: frozen capability fields + mutable runtime fields in ModelConfig, the pipeline group/step abstraction for composing models, and shared autograd utilities. The test coverage for base.py and pipeline.py is excellent, including analytical force correctness checks. However, there are broken AIMNet2 tests and stale documentation that need fixing before merge.

Issues Found

  1. [critical] test/models/test_aimnet2.py:L121-129,L145-151 -- All 14 AIMNet2 tests fail because _make_wrapper() and test_import_guard reference m._AIMNET_AVAILABLE which no longer exists. The module now uses @OptionalDependency.AIMNET.require instead of a module-level boolean flag.
    Fix: Rewrite _make_wrapper() to monkey-patch the OptionalDependency.AIMNET availability check instead of the removed _AIMNET_AVAILABLE flag. For test_import_guard, patch the decorator to simulate the import guard behavior.

  2. [critical] docs/modules/models.rst:L18,L88-95 -- References ModelCard (line 18) and nvalchemi.models.composable.ComposableModelWrapper (lines 88-95), both of which are deleted in this PR. Sphinx build will produce broken cross-references.
    Fix: Replace ModelCard with ModelConfig on line 18. Replace the "Composition" section with entries for PipelineModelWrapper, PipelineStep, PipelineGroup under nvalchemi.models.pipeline. Add AIMNet2Wrapper under a new ML Potentials subsection.

  3. [warning] docs/userguide/about/intro.md:L114,L150 -- Refers to ModelCard (line 114) and ComposableModelWrapper (line 150) in the intro guide table.
    Fix: Update to reference ModelConfig and PipelineModelWrapper.

  4. [warning] docs/index.md:L49 -- "standardized ModelCard interface" should be "standardized ModelConfig interface".

  5. [warning] nvalchemi/dynamics/base.py:L1394-1395 -- The instance variable name remains self.model_card but now points to a ModelConfig object. This naming mismatch is confusing.
    Fix: Consider renaming to self.model_config with a deprecation property for model_card.

  6. [warning] nvalchemi/models/pipeline.py _check_wiring() -- Accumulates available keys and logs them but performs no actual validation that downstream models' required inputs are satisfied by upstream outputs. The method name implies checking but it's purely informational.
    Fix: Either rename to _log_wiring() or add actual validation.

Downstream Impact

  • Dynamics: compute() rewritten with output detachment and flexible key mapping. FusedStage masked_update fix removes post_update call. All 936 dynamics tests pass.
  • Neighbor hooks: Refactored to delegate to shared _write_neighbor_data_to_batch(). Functionally equivalent.
  • __init__.py: Lazy imports updated. No remaining references to ModelCard, ComposableModelWrapper, or registry in source code.
  • Examples: 07_composable_model_composition.py updated for + operator. New 08_aimnet2_ewald_pipeline.py demonstrates autograd pipeline. Both look correct.

Test and Example Results

  • test_base.py: All PASS
  • test_pipeline.py: All PASS (including analytical force verification)
  • test_aimnet2.py: All 14 FAIL -- AttributeError: module has no attribute '_AIMNET_AVAILABLE'
  • test_composable.py: All 27 migration tests PASS
  • test_ewald.py: All PASS
  • test_pme.py: All PASS
  • test/dynamics/: All 936 non-slow tests PASS

User Documentation

Must update before merge:

  1. docs/modules/models.rst -- Remove ModelCard and ComposableModelWrapper; add pipeline and AIMNet2 classes
  2. docs/userguide/about/intro.md L114, L150 -- Replace ModelCard and ComposableModelWrapper
  3. docs/index.md L49 -- Replace ModelCard with ModelConfig
  4. Consider adding a doc page for nvalchemi.neighbors

Verdict

Request Changes -- The architecture and implementation quality are high, but all 14 AIMNet2 tests are broken and multiple doc pages reference deleted types. Fix the test helper, update the stale docs, and this is ready to merge.

Copy link
Copy Markdown
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left inline comments as well as the ones in my earlier comment

Comment thread docs/models/index.md
.. model-capability-table::
:category: ml
```{list-table}
:header-rows: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reinstate this later, but it was kind of nice to be able to auto-generate this table

Comment thread docs/userguide/models.md Outdated
Comment thread examples/advanced/README.rst
Comment thread nvalchemi/models/_utils.py Outdated
Combined output dict.
"""
additive = additive_keys or {"energy", "forces", "stress"}
result: ModelOutputs = OrderedDict()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but I imagine using TensorDict instead might be better than just the OrderedDict

Comment thread nvalchemi/models/base.py
Comment thread nvalchemi/models/base.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@dallasfoster
Copy link
Copy Markdown
Collaborator Author

/ok to test 4f069de

Comment thread nvalchemi/models/aimnet2.py
Comment on lines +303 to +305
# Enable grad on positions if any autograd output is active.
if self.model_config.autograd_outputs & self.model_config.active_outputs:
data.positions.requires_grad_(True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 requires_grad_(True) on non-leaf scaled_pos raises RuntimeError

When standalone stress computation is requested, forward() calls prepare_strain and replaces data["positions"] with scaled_pos before calling adapt_input. scaled_pos is a non-leaf tensor (derived from displacement.requires_grad_(True)), so calling requires_grad_(True) on it from inside adapt_input raises RuntimeError: you can only change requires_grad flags of leaf variables.

Reproduction path: standalone AIMNet2Wrapper with "stress" in active_outputs and a batch with a cell.

The fix is to enable gradients on positions before prepare_strain is applied — matching the pattern used by PipelineModelWrapper._run_autograd_group: enable requires_grad on positions first, then apply prepare_strain so strain is built on top of the leaf tensor, then remove the requires_grad_(True) call from adapt_input.

Comment thread nvalchemi/models/pipeline.py Outdated
Comment on lines +418 to +429
def _configure_sub_models(self) -> None:
"""Adjust sub-model configs based on group autograd strategy."""
for group in self.groups:
if group.use_autograd:
for step in group.steps:
# Remove derivative keys from sub-model active_outputs —
# the group will compute these via autograd.
new_active = set(step.model.model_config.active_outputs)
new_active -= {"forces", "stress"}
step.model.model_config = step.model.model_config.model_copy(
update={"active_outputs": new_active}
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _configure_sub_models permanently mutates sub-model model_config

At pipeline construction time, _configure_sub_models replaces each sub-model's model_config with a copy that has "forces" and "stress" stripped from active_outputs. Because step.model.model_config is replaced directly, the sub-model is permanently changed for the lifetime of the object.

If the same model instance is later reused — in a second pipeline, standalone, or passed to a dynamics engine — it will be missing forces/stress in its active_outputs, silently producing only energies. The second pipeline's _build_model_config will then synthesize an active_outputs that lacks derivatives, with no warning.

The fix is to record the desired active outputs at pipeline build time without mutating the sub-model, for example by storing the per-step override in a dict keyed by step identity instead of patching step.model.model_config.

Comment thread nvalchemi/models/base.py Outdated
@dallasfoster
Copy link
Copy Markdown
Collaborator Author

/ok to test a8987d4

@dallasfoster
Copy link
Copy Markdown
Collaborator Author

/ok to test ff98e22

Comment on lines +310 to +315
neighbor_config = NeighborConfig(
cutoff=max_cutoff,
format=chosen_format,
half_list=sub_neighbor_configs[0].half_list,
max_neighbors=max_neighbors,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 skin dropped when synthesizing pipeline NeighborConfig

_build_model_config constructs the merged NeighborConfig without forwarding skin, so it always defaults to 0.0. make_neighbor_hooks() then calls NeighborListHook(nc, skin=nc.skin) — but nc.skin is 0.0 regardless of what any sub-model specified. Any pipeline whose sub-models set skin > 0 will rebuild the neighbor list every step, silently discarding the Verlet-skin optimization.

@dallasfoster
Copy link
Copy Markdown
Collaborator Author

/ok to test c34cde0

Comment thread nvalchemi/models/pipeline.py
Comment thread nvalchemi/models/pipeline.py
@dallasfoster
Copy link
Copy Markdown
Collaborator Author

/ok to test a290412

@dallasfoster dallasfoster merged commit f9fc2e9 into NVIDIA:main Apr 10, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants