Skip to content

Adding CIF file support#83

Open
vratins wants to merge 7 commits into
mainfrom
dev_cif_support
Open

Adding CIF file support#83
vratins wants to merge 7 commits into
mainfrom
dev_cif_support

Conversation

@vratins

@vratins vratins commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

dataset.py and utils.py now support reading .cif files alongside .pdb:

  • _read_structure() dispatches to biotite's CIF or PDB reader based on file extension, used by parse_asu_with_biotite and compute_normalized_bfactors.

  • compute_normalized_bfactors no longer re-reads the file when bfactors are already available from parse_asu_with_biotite; the shared logic moves to _compute_normalized_bfactors_from_atoms().

  • _parse_pdb_list now prefers a PDB ID's .cif file over .pdb when both exist (checked via a single os.path.isfile, no directory scan).

  • utils.resolve_structure_path() resolves a preferred CIF/PDB path to whichever file actually exists on disk; parse_split_file() now returns preferred .cif paths for callers to resolve lazily.

Summary by CodeRabbit

  • New Features
    • Added support for CIF structure files alongside PDB.
    • When both are available, CIF is preferred for dataset and split-file structure resolution.
  • Bug Fixes
    • More robust handling of missing/unreadable structure files during embedding generation and preprocessing, with more consistent failure behavior.
  • Improvements
    • Faster B-factor normalization by reusing already-parsed structure data and ensuring required atom fields are available.
    • Improved crystal-contact handling to use the resolved structure source.
  • Tests
    • Expanded CIF parsing, structure-selection, and dataset/cache behavior coverage.

Copilot AI review requested due to automatic review settings June 17, 2026 04:32
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vratins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 34 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1e1e396-1809-4d9b-9b44-4a3129919787

📥 Commits

Reviewing files that changed from the base of the PR and between 6a78f4c and 80d89de.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • src/dataset.py
  • tests/test_dataset.py
📝 Walkthrough

Walkthrough

The PR resolves structure files to CIF or PDB paths and threads the resolved path through dataset parsing, preprocessing, embedding generation, and tests. It also updates normalized B-factor computation to reuse already-parsed atoms.

Changes

CIF Structure Format Support

Layer / File(s) Summary
Resolved structure paths
src/utils.py, tests/test_utils.py
parse_split_file now resolves each entry to an existing structure path, preferring .cif and falling back to .pdb, and the matching split-file tests cover CIF preference, PDB fallback, and missing-file errors.
Dataset parsing and preprocessing
src/dataset.py
src/dataset.py adds suffix-aware structure loading, uses resolved structure paths throughout preprocessing, and computes normalized B-factors from already parsed water atoms.
Embedding scripts use resolved paths
scripts/generate_esm_embeddings.py, scripts/generate_slae_embeddings.py
The embedding scripts now consume struc_path entries and remove the pre-checks that reported missing PDB files before embedding or parsing.
Dataset and parsing behavior tests
tests/conftest.py, tests/test_dataset.py
Shared path-resolution fixtures are added, and dataset tests add CIF parsing coverage, update item assertions, and verify split parsing prefers CIF, falls back to PDB, and preserves requested ordering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through paths from CIF to PDB,
Sniffed out the right file for the dataset tree.
Water beads sparkle, B-factors align,
Embeddings now follow the resolved line.
Thump, thump — the rabbit says, “All set!” 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding CIF file support alongside existing PDB handling.
Docstring Coverage ✅ Passed Docstring coverage is 86.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_cif_support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds mmCIF (.cif) structure support alongside PDB across dataset loading and related utilities, plus tests and fixtures to validate CIF parsing and path selection behavior.

Changes:

  • Introduces _read_structure() in src/dataset.py to dispatch PDB vs CIF parsing via biotite, and reuses already-parsed water atoms for normalized B-factor computation.
  • Updates dataset list parsing to prefer an existing *_final.cif over *_final.pdb without scanning directories.
  • Adds utils.resolve_structure_path() and adjusts parse_split_file() to emit preferred CIF paths, with new unit/integration tests and a CIF test fixture file.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/dataset.py Adds CIF/PDB dispatch reader, refactors normalized B-factor computation, and prefers CIF in _parse_pdb_list().
src/utils.py Adds resolve_structure_path() and changes parse_split_file() to emit preferred CIF paths.
tests/test_dataset.py Adds CIF parsing integration tests and dataset list parsing tests for CIF preference and filesystem probing behavior.
tests/test_utils.py Adds unit tests for resolve_structure_path() and verifies parse_split_file() doesn’t touch the filesystem.
tests/conftest.py Adds a cif_6eey fixture resolver.
tests/test_files/6eey/6eey_final.cif Adds a real mmCIF test structure file for integration coverage.
Comments suppressed due to low confidence (1)

src/utils.py:156

  • parse_split_file()'s docstring claims it raises ValueError when the file contains only malformed lines, but the function never raises and will just return an empty list. This mismatch can mislead callers and tests; either implement the exception or remove the Raises: section.
    Raises:
        ValueError: If split_file contains only malformed lines
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils.py
Comment thread src/dataset.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dataset.py`:
- Around line 50-60: The _read_structure function uses path.endswith(".cif")
which will fail when path is a pathlib.Path object instead of a string. Convert
the path parameter to a string at the beginning of the function using str(path)
before performing the suffix check with endswith(), or alternatively use
pathlib.Path(path).suffix to extract the file extension in a path-agnostic way.
This ensures the function handles both string and Path-like inputs correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c02ff7f-09e1-4727-9242-456916027cae

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and b1ef5d2.

📒 Files selected for processing (8)
  • scripts/generate_esm_embeddings.py
  • scripts/generate_slae_embeddings.py
  • src/dataset.py
  • src/utils.py
  • tests/conftest.py
  • tests/test_dataset.py
  • tests/test_files/6eey/6eey_final.cif
  • tests/test_utils.py

Comment thread src/dataset.py Outdated
Copilot AI review requested due to automatic review settings June 24, 2026 01:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/utils.py:156

  • The docstring claims parse_split_file() raises ValueError when the split contains only malformed lines, but the implementation never raises (it just returns an empty list). Either implement the raise or remove this Raises: section to keep docs accurate.
    Raises:
        ValueError: If split_file contains only malformed lines
    """

Comment thread src/utils.py Outdated
Comment thread src/utils.py Outdated
Comment thread src/dataset.py

@marcuscollins marcuscollins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several things that could be improved, particularly in the tests. Please have a look and see if you can make some changes.

Comment thread src/utils.py Outdated
return ins


def resolve_structure_path(path: str | Path) -> Path | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems a little superfluous. I don't think you need to look for a different file extension if the specified one isn't there. Just raise an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, created a single stuc_path that should be a cif or pdb, and errors out in the parse function if the file does not exist. Made this change wherever a "pdb_path" is invoked.

Comment thread src/utils.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually iterate over the file directly by using for line in open(split_file, "r"):, it will manage the file closing for you.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, you should just create a csv or tsv file with the pieces you need. E.g., lines like:

1VME, /path/to/1VME.cif, <<cache_key>>

and then read them in with pandas or something similar.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change for the loop. As for the second comment, do you mean my input split files should be formatted that way? Since the path and cache keys can change depending on the directory we specify, hence I chose to go with a simple text file, and we can specify where to find the structure files (path dir) and where to create the cache.

Comment thread src/utils.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify this: parts = line.strip().split("_")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread tests/test_dataset.py Outdated

def test_cif_hydrogen_removed(self, cif_6eey):
"""Hydrogens should be removed from CIF-parsed arrays."""
protein_atoms, water_atoms = parse_asu_with_biotite(cif_6eey)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be a test fixtures since they are re-used many times. It would speed up your tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread tests/test_dataset.py
base_dir = tmp_path / "pdbs"
self._write_structure(base_dir, "abcd", [".cif", ".pdb"])

list_file = tmp_path / "list.txt"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be sure that the tmp directory is cleaned up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pytest tmp_path are cleaned up by default

Comment thread tests/test_dataset.py Outdated
base_dir / "keep2" / "keep2_final.cif"
)

def test_parse_does_not_probe_filesystem(self, tmp_path, monkeypatch):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird test. Is there some reason we care about this? Remove it if there isn't some very specific concern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread tests/test_dataset.py Outdated
)

assert [entry["pdb_id"] for entry in dataset.entries] == ["keep1", "keep2"]
assert dataset.entries[0]["pdb_path"] == (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing a couple different things:

  1. which ids are retained and which are discarded
  2. a specific set of paths and file name conventions.

In general a unit test should really test one thing. To achieve that here, you could create a method that generates the file paths from the pdb id, and then test that method separately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed below.

Comment thread tests/test_utils.py Outdated
"""Tests for lazy CIF/PDB path resolution helpers."""

@staticmethod
def _write_structure(base_dir: Path, pdb_id: str, suffixes: list[str]) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these tests are redundant with some in test_dataset.py. That just adds time to your test runs. For instance, a lot of the tests above are implicitly testing resolve_structure_path. Please go through these and minimize them. You can use a test coverage report to determine whether you've decreased your coverage when you remove any tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing both this and the above comment, I've tried to remove some redundant tests, and consolidate as much as I can while minimizing coverage.

Copilot AI review requested due to automatic review settings June 24, 2026 20:43
@vratins vratins removed the request for review from Copilot June 24, 2026 20:43
Copilot AI review requested due to automatic review settings June 24, 2026 20:52

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Comment thread src/dataset.py
Comment thread src/utils.py

parts = line.split("_")
if len(parts) < 2:
for line in open(split_file, "r"):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcuscollins I'll defer a decision on this to you -- is this needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants