Adding CIF file support by vratins · Pull Request #83 · diff-use/WaterFlow

vratins · 2026-06-17T04:32:03Z

dataset.py and utils.py now support reading .cif files alongside .pdb:

_read_structure() dispatches to biotite's CIF or PDB reader based on file extension, used by parse_asu_with_biotite and compute_normalized_bfactors.
compute_normalized_bfactors no longer re-reads the file when bfactors are already available from parse_asu_with_biotite; the shared logic moves to _compute_normalized_bfactors_from_atoms().
_parse_pdb_list now prefers a PDB ID's .cif file over .pdb when both exist (checked via a single os.path.isfile, no directory scan).
utils.resolve_structure_path() resolves a preferred CIF/PDB path to whichever file actually exists on disk; parse_split_file() now returns preferred .cif paths for callers to resolve lazily.

Summary by CodeRabbit

New Features
- Added support for CIF structure files alongside PDB.
- When both are available, CIF is preferred for dataset and split-file structure resolution.
Bug Fixes
- More robust handling of missing/unreadable structure files during embedding generation and preprocessing, with more consistent failure behavior.
Improvements
- Faster B-factor normalization by reusing already-parsed structure data and ensuring required atom fields are available.
- Improved crystal-contact handling to use the resolved structure source.
Tests
- Expanded CIF parsing, structure-selection, and dataset/cache behavior coverage.

coderabbitai · 2026-06-17T04:32:11Z

Warning

Review limit reached

@vratins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 34 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1e1e396-1809-4d9b-9b44-4a3129919787

📥 Commits

Reviewing files that changed from the base of the PR and between 6a78f4c and 80d89de.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

src/dataset.py
tests/test_dataset.py

📝 Walkthrough

Walkthrough

The PR resolves structure files to CIF or PDB paths and threads the resolved path through dataset parsing, preprocessing, embedding generation, and tests. It also updates normalized B-factor computation to reuse already-parsed atoms.

Changes

CIF Structure Format Support

Layer / File(s)	Summary
Resolved structure paths `src/utils.py`, `tests/test_utils.py`	`parse_split_file` now resolves each entry to an existing structure path, preferring `.cif` and falling back to `.pdb`, and the matching split-file tests cover CIF preference, PDB fallback, and missing-file errors.
Dataset parsing and preprocessing `src/dataset.py`	`src/dataset.py` adds suffix-aware structure loading, uses resolved structure paths throughout preprocessing, and computes normalized B-factors from already parsed water atoms.
Embedding scripts use resolved paths `scripts/generate_esm_embeddings.py`, `scripts/generate_slae_embeddings.py`	The embedding scripts now consume `struc_path` entries and remove the pre-checks that reported missing PDB files before embedding or parsing.
Dataset and parsing behavior tests `tests/conftest.py`, `tests/test_dataset.py`	Shared path-resolution fixtures are added, and dataset tests add CIF parsing coverage, update item assertions, and verify split parsing prefers CIF, falls back to PDB, and preserves requested ordering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through paths from CIF to PDB,
Sniffed out the right file for the dataset tree.
Water beads sparkle, B-factors align,
Embeddings now follow the resolved line.
Thump, thump — the rabbit says, “All set!” 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding CIF file support alongside existing PDB handling.
Docstring Coverage	✅ Passed	Docstring coverage is 86.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev_cif_support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

Adds mmCIF (.cif) structure support alongside PDB across dataset loading and related utilities, plus tests and fixtures to validate CIF parsing and path selection behavior.

Changes:

Introduces _read_structure() in src/dataset.py to dispatch PDB vs CIF parsing via biotite, and reuses already-parsed water atoms for normalized B-factor computation.
Updates dataset list parsing to prefer an existing *_final.cif over *_final.pdb without scanning directories.
Adds utils.resolve_structure_path() and adjusts parse_split_file() to emit preferred CIF paths, with new unit/integration tests and a CIF test fixture file.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/dataset.py`	Adds CIF/PDB dispatch reader, refactors normalized B-factor computation, and prefers CIF in `_parse_pdb_list()`.
`src/utils.py`	Adds `resolve_structure_path()` and changes `parse_split_file()` to emit preferred CIF paths.
`tests/test_dataset.py`	Adds CIF parsing integration tests and dataset list parsing tests for CIF preference and filesystem probing behavior.
`tests/test_utils.py`	Adds unit tests for `resolve_structure_path()` and verifies `parse_split_file()` doesn’t touch the filesystem.
`tests/conftest.py`	Adds a `cif_6eey` fixture resolver.
`tests/test_files/6eey/6eey_final.cif`	Adds a real mmCIF test structure file for integration coverage.

Comments suppressed due to low confidence (1)

src/utils.py:156

parse_split_file()'s docstring claims it raises ValueError when the file contains only malformed lines, but the function never raises and will just return an empty list. This mismatch can mislead callers and tests; either implement the exception or remove the Raises: section.

    Raises:
        ValueError: If split_file contains only malformed lines
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dataset.py`:
- Around line 50-60: The _read_structure function uses path.endswith(".cif")
which will fail when path is a pathlib.Path object instead of a string. Convert
the path parameter to a string at the beginning of the function using str(path)
before performing the suffix check with endswith(), or alternatively use
pathlib.Path(path).suffix to extract the file extension in a path-agnostic way.
This ensures the function handles both string and Path-like inputs correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c02ff7f-09e1-4727-9242-456916027cae

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and b1ef5d2.

📒 Files selected for processing (8)

scripts/generate_esm_embeddings.py
scripts/generate_slae_embeddings.py
src/dataset.py
src/utils.py
tests/conftest.py
tests/test_dataset.py
tests/test_files/6eey/6eey_final.cif
tests/test_utils.py

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/utils.py:156

The docstring claims parse_split_file() raises ValueError when the split contains only malformed lines, but the implementation never raises (it just returns an empty list). Either implement the raise or remove this Raises: section to keep docs accurate.

    Raises:
        ValueError: If split_file contains only malformed lines
    """

marcuscollins

There are several things that could be improved, particularly in the tests. Please have a look and see if you can make some changes.

marcuscollins · 2026-06-24T16:23:22Z

    return ins


+def resolve_structure_path(path: str | Path) -> Path | None:


This function seems a little superfluous. I don't think you need to look for a different file extension if the specified one isn't there. Just raise an error.

Done, created a single stuc_path that should be a cif or pdb, and errors out in the parse function if the file does not exist. Made this change wherever a "pdb_path" is invoked.

marcuscollins · 2026-06-24T16:26:42Z

You can actually iterate over the file directly by using for line in open(split_file, "r"):, it will manage the file closing for you.

That said, you should just create a csv or tsv file with the pieces you need. E.g., lines like:

1VME, /path/to/1VME.cif, <<cache_key>>

and then read them in with pandas or something similar.

Made the change for the loop. As for the second comment, do you mean my input split files should be formatted that way? Since the path and cache keys can change depending on the directory we specify, hence I chose to go with a simple text file, and we can specify where to find the structure files (path dir) and where to create the cache.

marcuscollins · 2026-06-24T16:28:04Z

Simplify this: parts = line.strip().split("_")

marcuscollins · 2026-06-24T16:32:01Z

+
+    def test_cif_hydrogen_removed(self, cif_6eey):
+        """Hydrogens should be removed from CIF-parsed arrays."""
+        protein_atoms, water_atoms = parse_asu_with_biotite(cif_6eey)


These could be a test fixtures since they are re-used many times. It would speed up your tests.

marcuscollins · 2026-06-24T16:33:37Z

+        base_dir = tmp_path / "pdbs"
+        self._write_structure(base_dir, "abcd", [".cif", ".pdb"])
+
+        list_file = tmp_path / "list.txt"


can you be sure that the tmp directory is cleaned up?

Yes, pytest tmp_path are cleaned up by default

marcuscollins · 2026-06-24T16:38:53Z

+            base_dir / "keep2" / "keep2_final.cif"
+        )
+
+    def test_parse_does_not_probe_filesystem(self, tmp_path, monkeypatch):


This is a weird test. Is there some reason we care about this? Remove it if there isn't some very specific concern.

marcuscollins · 2026-06-24T16:41:49Z

+        )
+
+        assert [entry["pdb_id"] for entry in dataset.entries] == ["keep1", "keep2"]
+        assert dataset.entries[0]["pdb_path"] == (


This is testing a couple different things:

which ids are retained and which are discarded

a specific set of paths and file name conventions.

In general a unit test should really test one thing. To achieve that here, you could create a method that generates the file paths from the pdb id, and then test that method separately.

Addressed below.

marcuscollins · 2026-06-24T16:44:34Z

+    """Tests for lazy CIF/PDB path resolution helpers."""
+
+    @staticmethod
+    def _write_structure(base_dir: Path, pdb_id: str, suffixes: list[str]) -> None:


A lot of these tests are redundant with some in test_dataset.py. That just adds time to your test runs. For instance, a lot of the tests above are implicitly testing resolve_structure_path. Please go through these and minimize them. You can use a test coverage report to determine whether you've decreased your coverage when you remove any tests.

Addressing both this and the above comment, I've tried to remove some redundant tests, and consolidate as much as I can while minimizing coverage.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

vratins · 2026-06-24T21:08:01Z

-
-            parts = line.split("_")
-            if len(parts) < 2:
+    for line in open(split_file, "r"):


@marcuscollins I'll defer a decision on this to you -- is this needed?

dataset.py and utils.py now support reading .cif files alongside .pdb

629e874

Copilot AI review requested due to automatic review settings June 17, 2026 04:32

vratins temporarily deployed to testing June 17, 2026 04:32 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 17, 2026 04:32 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread src/utils.py

Comment thread src/dataset.py Outdated

addressing comments

b1ef5d2

vratins temporarily deployed to testing June 24, 2026 00:31 — with GitHub Actions Inactive

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/dataset.py Outdated

addressing comments

ed6cfe0

Copilot AI review requested due to automatic review settings June 24, 2026 01:05

vratins temporarily deployed to testing June 24, 2026 01:05 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 24, 2026 01:06 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/utils.py Outdated

Comment thread src/utils.py Outdated

Comment thread src/dataset.py

marcuscollins reviewed Jun 24, 2026

View reviewed changes

addressing CR comments and cleaning up tests

87d72be

vratins temporarily deployed to testing June 24, 2026 20:41 — with GitHub Actions Inactive

Auto-commit ruff fixes [skip ci]

6a78f4c

Copilot AI review requested due to automatic review settings June 24, 2026 20:43

vratins removed the request for review from Copilot June 24, 2026 20:43

Merge branch 'main' into dev_cif_support

e5f7b30

Copilot AI review requested due to automatic review settings June 24, 2026 20:52

vratins had a problem deploying to testing June 24, 2026 20:52 — with GitHub Actions Failure

vratins temporarily deployed to testing June 24, 2026 20:52 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 24, 2026 20:52 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

fix merge issues and redundant tests

80d89de

vratins temporarily deployed to testing June 24, 2026 21:06 — with GitHub Actions Inactive

		return ins


		def resolve_structure_path(path: str \| Path) -> Path \| None:

Uh oh!

Conversation

vratins commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

vratins commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading