Skip to content

feat(eval): canonicalize mixed-altloc modified residues#263

Open
k-chrispens wants to merge 1 commit into
mainfrom
kmc/canonicalize-altloc-residues
Open

feat(eval): canonicalize mixed-altloc modified residues#263
k-chrispens wants to merge 1 commit into
mainfrom
kmc/canonicalize-altloc-residues

Conversation

@k-chrispens

@k-chrispens k-chrispens commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Adds a processing utility that is necessary for getting certain proteins (6NI6/6NI5) to work properly in the evaluation scripts, since those contain compositional heterogeneity in the form of a modified amino acid.

Summary by CodeRabbit

  • New Features

    • Improved protein structure processing with enhanced alternate location (altloc) residue handling
    • Modified amino acids are now automatically mapped to their canonical parent forms, ensuring proper processing of mixed altloc structures
  • Tests

    • Added comprehensive test coverage for canonicalization functionality

Copilot AI review requested due to automatic review settings June 16, 2026 22:36
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@k-chrispens, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 39 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f08ef92c-61bb-46f2-9ee5-e623a4a824cf

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4a8a3 and 4b8cc2a.

📒 Files selected for processing (2)
  • src/sampleworks/eval/structure_utils.py
  • tests/eval/test_structure_utils.py
📝 Walkthrough

Walkthrough

Adds two helpers to structure_utils.py: _closest_canonical_amino_acid maps modified residue names (e.g., CSO→CYS) to canonical parents, and canonicalize_mixed_altloc_residues normalizes mixed-altloc positions by renaming modified residues and clearing hetero flags. The canonicalization is applied during reference-structure loading. Tests validate all mapping and mutation-safety behaviors.

Changes

Mixed altloc canonicalization

Layer / File(s) Summary
Canonicalization helpers and supporting imports
src/sampleworks/eval/structure_utils.py
Adds defaultdict, atomworks residue-name conversion imports, _closest_canonical_amino_acid (modified→canonical three-letter name, None for non-amino-acids), and canonicalize_mixed_altloc_residues (copies input, groups atoms by (chain_id, res_id, ins_code), detects res_name disagreement across altlocs, renames modified residues to canonical parent, clears hetero for renamed records).
Reference-structure integration and tests
src/sampleworks/eval/structure_utils.py, tests/eval/test_structure_utils.py
Pipes loaded reference structure through canonicalize_mixed_altloc_residues before downstream processing. TestCanonicalizeMixedAltlocResidues covers _closest_canonical_amino_acid parameter cases (CSO→CYS, MSE→MET, ALA unchanged, HOH→None), canonicalize_mixed_altloc_residues renaming and hetero-clearing at mixed positions, MSE non-mutation, and input immutability.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A residue mixed up and lost its way,
CSO wanted CYS's role to play.
With a copy, a group, a lookup so keen,
The hetero flag cleared, the altloc serene.
Hoppity-hop, canonical and bright —
Every position stacked just right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding canonicalization for mixed-altloc modified residues, which aligns with the PR's core functionality and objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmc/canonicalize-altloc-residues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evaluation-time structure preprocessing utility to handle compositional heterogeneity where modified residues (e.g., CSO) appear as altlocs alongside their canonical forms (e.g., CYS), enabling reference structures like 6NI6/6NI5 to be processed successfully for evaluation.

Changes:

  • Introduces _closest_canonical_amino_acid() and canonicalize_mixed_altloc_residues() in eval/structure_utils.py to canonicalize mixed-altloc modified residues.
  • Wires the canonicalization step into get_reference_atomarraystack() prior to stacking altlocs.
  • Adds unit tests covering mapping behavior and non-mutating semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/sampleworks/eval/structure_utils.py Adds canonicalization utilities and applies them during reference structure loading to prevent altloc stacking failures.
tests/eval/test_structure_utils.py Adds tests validating the canonical mapping and mixed-position renaming behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/sampleworks/eval/structure_utils.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/eval/test_structure_utils.py (1)

11-13: ⚡ Quick win

Avoid coupling tests to private helper internals.

Line 11 and Line 370 test _closest_canonical_amino_acid directly. Prefer validating behavior through canonicalize_mixed_altloc_residues(...) inputs/outputs so tests stay black-box and resilient to internal refactors. As per coding guidelines, "Write black-box tests that verify behavior, not implementation."

Also applies to: 366-371

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_structure_utils.py` around lines 11 - 13, Remove the import
of the private function _closest_canonical_amino_acid from the imports section,
and refactor any direct tests of _closest_canonical_amino_acid to instead
validate its behavior indirectly through the public API by testing inputs and
outputs of canonicalize_mixed_altloc_residues. This ensures tests remain
black-box and resilient to internal refactoring of implementation details.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 353-355: Add NumPy-style docstrings to the test methods in the
specified ranges. For the _mixed_array() method and all other new test methods
in the affected ranges, add docstrings that follow NumPy style, which should
include a brief summary of what the method does, a Returns section describing
what is returned, and any other relevant sections as needed. Each docstring
should be placed immediately after the method signature and before any
implementation code or comments.

---

Nitpick comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 11-13: Remove the import of the private function
_closest_canonical_amino_acid from the imports section, and refactor any direct
tests of _closest_canonical_amino_acid to instead validate its behavior
indirectly through the public API by testing inputs and outputs of
canonicalize_mixed_altloc_residues. This ensures tests remain black-box and
resilient to internal refactoring of implementation details.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca247f13-768a-4038-8b42-68911a7e58f6

📥 Commits

Reviewing files that changed from the base of the PR and between b87cc5f and 5c4a8a3.

📒 Files selected for processing (2)
  • src/sampleworks/eval/structure_utils.py
  • tests/eval/test_structure_utils.py

Comment on lines +353 to +355
@staticmethod
def _mixed_array() -> AtomArray:
# (A,10): canonical CYS + modified CSO compositional heterogeneity

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add NumPy-style docstrings to the new test methods.

The new methods in this block are missing docstrings; this repo requires NumPy-style docstrings for every function and class. As per coding guidelines, "Always include NumPy-style docstrings for every function and class."

Also applies to: 370-384

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_structure_utils.py` around lines 353 - 355, Add NumPy-style
docstrings to the test methods in the specified ranges. For the _mixed_array()
method and all other new test methods in the affected ranges, add docstrings
that follow NumPy style, which should include a brief summary of what the method
does, a Returns section describing what is returned, and any other relevant
sections as needed. Each docstring should be placed immediately after the method
signature and before any implementation code or comments.

Source: Coding guidelines

@marcuscollins marcuscollins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this and I think it should ultimately replace the more involved method I previously wrote to handle compositional heterogeneity. But I think there are a couple of problems, mainly that I think this will leave extra atoms (most of which are duplicates of each other and probably have an altloc id that is different, but some will be atoms that are incompatible with the new residue type).

return None if parent == UNKNOWN_AA else parent


def canonicalize_mixed_altloc_residues(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this solve the same problem as resolve_mixed_hetatm_atom_altlocs? If so, IIRC AtomWorks inserts an extra residue, and it looks to me like this might not catch that.

Also, if this does solve that same problem, I'd rather we have one solution, not two which could diverge. I'm not saying my solution was better (it is a bit of a hack, tbh, to have to write a temporary CIF file). But before we put this in, we should unify the two (again, if they are the same)

continue # single consistent res_name means it doesn't have comp het
parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name))
if parent is not None and parent != res_name:
out.res_name[i] = parent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the two residues don't have the same atoms? I see a couple issues here. One is that AtomWorks' parse will insert extra atoms when there's compositional heterogeneity. So, for example in 6NI6, you would end up with two CA, C, O, etc... as well as extra atoms that aren't compatible with the residue type anymore. I think probably we need to handle that and canonicalize the atoms, not just the residue names.

if len(res_names_by_position[position]) == 1:
continue # single consistent res_name means it doesn't have comp het
parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name))
if parent is not None and parent != res_name:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to canonicalize all residues, or just make the sequences the same? What if a structure has two different non-canonicals at the same position, and no canonicals?

# form (e.g. CYS) makes map_altlocs_to_stack fail: the conformers disagree on res_name/hetero
# so biotite.stack() rejects them. Canonicalize these residues to get map_altlocs_to_stack to
# work
ref_struct = canonicalize_mixed_altloc_residues(load_structure_with_altlocs(ref_path))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think looking at your code, you should try to replace my method with yours although I'm pretty sure you need to get rid of extra atoms still.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have time, make an issue and tag with engineering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants