Add method to find overlapping excerpts#272
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #272 +/- ##
===========================================
+ Coverage 79.76% 79.82% +0.05%
===========================================
Files 23 23
Lines 2120 2126 +6
===========================================
+ Hits 1691 1697 +6
Misses 429 429 🚀 New features to boost your workflow:
|
laurejt
left a comment
There was a problem hiding this comment.
I've added my initial thoughts on merge_excerpts.py, but haven't looked at its unit tests yet.
Some higher-level musings:
- Perhaps it'd be useful to add an additional field with excerpt length given how often we operate on it.
- Would it ever be useful to use the
map_groupsmethod? I know it's generally slower, but perhaps it could be useful at some point...
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
9552772 to
ed08cc7
Compare
ed08cc7 to
4b83c42
Compare
Assisted-by: Claude:sonnet-4-6 [ClaudeCode]
7763df1 to
6339d5b
Compare
|
@laurejt I've cleaned up based on your feedback and added the core objects to our sphinx docs (which I also split out into separate files for each top-level module to be more manageable). There are more changes in the PR now because I've cleaned up the docs, so please focus on the |
laurejt
left a comment
There was a problem hiding this comment.
This is looking really good. I think it'd be nice to include poem id fields for identify_overlapping_excerpts's output. See my comments for some additional unit tests that you should consider adding primarily around the ordering logic of results.
| groups of excerpts into merged excerpts. Fields are expected to correspond to | ||
| labeled excerpts (:class:`~corppa.poetry_detection.core.LabeledExcerpt`), and the |
There was a problem hiding this comment.
Is it possible for there to be a group of unlabeled excerpts that should be merged (i.e., the collective poem_id would be null)?
There was a problem hiding this comment.
Possible but not very likely; I don't think any of the logic here would cause problems in that case. If you want to improve the language please suggest changes.
There was a problem hiding this comment.
I think it's fine, just figured I'd ask. It's just a slight type misalignment since a LabeledExcerpt's poem_id field cannot be None
There was a problem hiding this comment.
right, technically it could be a mix of labeled and unlabeled excerpts
| .explode() | ||
| .unique(), # combine in a single list, no repeats | ||
| # NOTE: does not handle overlapping spans; | ||
| # currently assumes text spans match or first in group is complete |
There was a problem hiding this comment.
Does "complete" here mean is labeled?
There was a problem hiding this comment.
I think what I meant is that it is the full text / longest span of the group but I agree it doesn't make a lot of sense - right now we're only using this on equal spans and I do have a note about that
| ) | ||
| ) | ||
|
|
||
| # what fields are needed here? |
There was a problem hiding this comment.
Including poem ids (when available) could also be useful
| assert merged_ex1.identification_methods == {"manual", "refmatcha"} | ||
|
|
||
|
|
||
| ### test for identify_overlapping_excerpts |
There was a problem hiding this comment.
I think it's worth adding a couple unit tests for testing the behavior of specifying min_overlap_factor and min_overlap_chars especially the boundaries (i.e., greater than vs greater)
|
@rlskoeser When I build the docs locally, I receive the following warning:
|
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Resolves warning about unresolved link in readme
|
Added missing target for license in the docs, added additional test cases. |
Associated Issue(s): resolves #251
Changes in this PR
Notes
_rightdefault but haven't come up with anything better); extra fields have been helpful when testing, should they be optional? removed?Review checklist