Skip to content

Add method to find overlapping excerpts#272

Merged
rlskoeser merged 13 commits intodevelopfrom
feature/find-overlapping-excerpts
Apr 21, 2026
Merged

Add method to find overlapping excerpts#272
rlskoeser merged 13 commits intodevelopfrom
feature/find-overlapping-excerpts

Conversation

@rlskoeser
Copy link
Copy Markdown
Collaborator

@rlskoeser rlskoeser commented Apr 16, 2026

Associated Issue(s): resolves #251

Changes in this PR

  • Moved logic for merging grouped excerpts into a method with the hope of making it reusable
  • Added new method for identifying overlapping excerpts; adapted from existing span overlap factor logic

Notes

  • Overlap length / factor logic is adapted from @laurejt span evaluation code
  • Would appreciate input what fields to return and field names (I don't love the _right default but haven't come up with anything better); extra fields have been helpful when testing, should they be optional? removed?

Review checklist

  • check logic for identifying high-confidence overlap
  • weigh in on returned field question above
  • check unit tests - are there important cases I should add? (more than one pair?)

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.82%. Comparing base (074661f) to head (5f9e59d).
⚠️ Report is 15 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #272      +/-   ##
===========================================
+ Coverage    79.76%   79.82%   +0.05%     
===========================================
  Files           23       23              
  Lines         2120     2126       +6     
===========================================
+ Hits          1691     1697       +6     
  Misses         429      429              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rlskoeser rlskoeser requested a review from laurejt April 16, 2026 20:59
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added my initial thoughts on merge_excerpts.py, but haven't looked at its unit tests yet.

Some higher-level musings:

  • Perhaps it'd be useful to add an additional field with excerpt length given how often we operate on it.
  • Would it ever be useful to use the map_groups method? I know it's generally slower, but perhaps it could be useful at some point...

Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py
Comment thread src/corppa/poetry_detection/merge_excerpts.py
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
@rlskoeser rlskoeser force-pushed the feature/find-overlapping-excerpts branch from 9552772 to ed08cc7 Compare April 20, 2026 19:52
@rlskoeser rlskoeser force-pushed the feature/find-overlapping-excerpts branch from ed08cc7 to 4b83c42 Compare April 20, 2026 20:09
@rlskoeser rlskoeser force-pushed the feature/find-overlapping-excerpts branch from 7763df1 to 6339d5b Compare April 21, 2026 14:45
@rlskoeser
Copy link
Copy Markdown
Collaborator Author

@laurejt I've cleaned up based on your feedback and added the core objects to our sphinx docs (which I also split out into separate files for each top-level module to be more manageable). There are more changes in the PR now because I've cleaned up the docs, so please focus on the identify_overlapping_excerpts method: do you see any problems with the logic? Do you have an opinion on the fields returned? Are there any important cases you think I should cover in the unit tests?

@rlskoeser rlskoeser requested a review from laurejt April 21, 2026 14:52
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. I think it'd be nice to include poem id fields for identify_overlapping_excerpts's output. See my comments for some additional unit tests that you should consider adding primarily around the ordering logic of results.

Comment on lines +61 to +62
groups of excerpts into merged excerpts. Fields are expected to correspond to
labeled excerpts (:class:`~corppa.poetry_detection.core.LabeledExcerpt`), and the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for there to be a group of unlabeled excerpts that should be merged (i.e., the collective poem_id would be null)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible but not very likely; I don't think any of the logic here would cause problems in that case. If you want to improve the language please suggest changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, just figured I'd ask. It's just a slight type misalignment since a LabeledExcerpt's poem_id field cannot be None

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, technically it could be a mix of labeled and unlabeled excerpts

.explode()
.unique(), # combine in a single list, no repeats
# NOTE: does not handle overlapping spans;
# currently assumes text spans match or first in group is complete
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "complete" here mean is labeled?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I meant is that it is the full text / longest span of the group but I agree it doesn't make a lot of sense - right now we're only using this on equal spans and I do have a note about that

Comment thread src/corppa/poetry_detection/merge_excerpts.py Outdated
)
)

# what fields are needed here?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including poem ids (when available) could also be useful

Comment thread tests/test_poetry_detection/test_merge_excerpts.py
Comment thread tests/test_poetry_detection/test_merge_excerpts.py
assert merged_ex1.identification_methods == {"manual", "refmatcha"}


### test for identify_overlapping_excerpts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a couple unit tests for testing the behavior of specifying min_overlap_factor and min_overlap_chars especially the boundaries (i.e., greater than vs greater)

Comment thread tests/test_poetry_detection/test_merge_excerpts.py
Comment thread tests/test_poetry_detection/test_merge_excerpts.py
@laurejt
Copy link
Copy Markdown
Contributor

laurejt commented Apr 21, 2026

@rlskoeser When I build the docs locally, I receive the following warning:

/Users/lauret/cdh-dev/ppa/corppa/README.md:36: WARNING: 'myst' cross-reference target not found: 'LICENSE' [myst.xref_missing]

rlskoeser and others added 3 commits April 21, 2026 13:20
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Resolves warning about unresolved link in readme
@rlskoeser
Copy link
Copy Markdown
Collaborator Author

Added missing target for license in the docs, added additional test cases.

@rlskoeser rlskoeser merged commit a0d336c into develop Apr 21, 2026
9 checks passed
@rlskoeser rlskoeser deleted the feature/find-overlapping-excerpts branch April 21, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants