Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new user-facing helper to aggregate Bambu transcript-level counts to exon-level counts, returning a new SummarizedExperiment/RangedSummarizedExperiment keyed by unique exon coordinates.
Changes:
- Introduces
summariseByExon()to collapse transcript counts onto unique exonic loci (seqnames/start/end/strand). - Constructs exon-level
GRangesand count matrix output as a newSummarizedExperiment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Return as SummarizedExperiment | ||
| return(SummarizedExperiment( | ||
| assays = list(counts = exonCounts), |
There was a problem hiding this comment.
SummarizedExperiment(assays = ...) is passed a base list, but the rest of the codebase consistently uses SimpleList(...) for assays when constructing SummarizedExperiment objects. Using SimpleList here would keep the return value consistent with other constructors and avoid any subtle differences in downstream behavior.
| assays = list(counts = exonCounts), | |
| assays = S4Vectors::SimpleList(counts = exonCounts), |
| exonDt <- data.table( | ||
| TXNAME = names(exonRanges), |
There was a problem hiding this comment.
TXNAME = names(exonRanges) is unlikely to match transcript IDs in rownames(se) because unlist(rowRanges(se), use.names=TRUE) typically produces names like "<TXNAME>.<exon_index>" when the per-exon ranges are unnamed. This will make match(exonDt$TXNAME, txNames) return NA, and sparseMatrix(i=..., j=...) will error or silently drop mappings. Consider deriving transcript IDs explicitly (e.g., via stack(rowRanges(se))$ind, or rep(names(rowRanges(se)), elementNROWS(rowRanges(se)))) and use that value for TXNAME.
| exonDt <- data.table( | |
| TXNAME = names(exonRanges), | |
| exonTxNames <- rep(txNames, lengths(rowRanges(se))) | |
| exonDt <- data.table( | |
| TXNAME = exonTxNames, |
There was a problem hiding this comment.
i think original code is good, names(exonRanges) is simpler
| #' @importFrom GenomicRanges GRanges seqnames start end strand | ||
| #' @importFrom IRanges IRanges | ||
| #' @export | ||
| summariseByExon <- function(se) { |
There was a problem hiding this comment.
This new exported function introduces non-trivial aggregation logic (mapping transcript-by-exon membership and summing counts) but there are no accompanying unit tests. Please add a testthat test that runs summariseByExon() on an existing extdata se object and asserts basic invariants (e.g., output class, column data preserved, and a few known exon-level totals).
Covers end-to-end execution on the bundled fixture SE and verifies the exon-aggregation identity against synthetic Poisson counts, including a per-exon hand-check on the most-shared exon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I just asked claude code to review the PR and test the PR, so it created unit tests for this which I have already committed to the summarise-by-exon branch. And here is a summary of the issues identified by claude code, my summary is maybe can ignore the MInor issues, but the blocking issues and worth addressing issues can fix. Prompt I used: Blocking / should fix before merge
Worth addressing
Minor
|
Code reviewFound 8 issues (scored 0-100 for confidence): (JG:threshold is 80, all below threshold)
Lines 13 to 15 in e6c5547
Lines 65 to 81 in e6c5547
Lines 62 to 64 in e6c5547
Lines 76 to 78 in e6c5547
Lines 3 to 11 in e6c5547
Lines 1 to 14 in e6c5547
Lines 62 to 64 in e6c5547
Lines 78 to 80 in e6c5547 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
jonathangoeke
left a comment
There was a problem hiding this comment.
specific comments:
-names are not very informative. do we need names? this should ideally be official exonIds, which we might not track in bambu? align with other summarisefunctions
-mcols: need to have a map of exon to transcripts.
-mcols: add if novel exon or annotated exon (only present in novel transcripts vs at least one annotated transcript)
-provide option which assays to summarise (by default all?)
general comments/combine this PR with other updates later?
- do not add extra file. instead combined with existing file. rename file to summariseExpression, and summariseExpression_utlity?
- make sure that code is consistently implemented, use same style reuse code if possible (e.g same as transcriptToGeneexpression)
- add example in documentation/ user facing functions should be well documented
- add documentation, or create draft issue that this needs to be documented later
- one function to summariseExpression('exons/transcript/start/end')? or individual functions? they should use same structure to avoid uplicated code. e.g define summary-map, summarise assays, summarise ranges, summarise mcol, ...? if possible
- use dplyr to be consistent with major code in bambu
unit test:
-document unit test lines, what is tested? why is this relevant? if the unit test function needs to be clear otherwise not clear if fails and difficult to maintain
- unit test that just reimplement the code are not so useful. It might be helpful to define a manual examples of 2 genes with 1 trancript and with 3 transcripts and multpile exons, and transcripts counts, and define expected output e.g. 5 rows, 3 read counts
-unit test should cover multiple samples (min 2)
-current unit test object has zero count is that meaningful?
Related Issue
Fixes # (issue number)
Type of Change
Description
Add summariseByExon() function (function is found in newly created summariseByExon.R script)
Adds a new exported utility function summariseByExon() that aggregates transcript-level expression from a bambu SummarizedExperiment down to unique exon-level counts.
How it works:
Returns: A RangedSummarizedExperiment where counts are for each unique exon across all overlapping transcripts.
Implementation Details
Code change does not affect main bambu code as it is implemented as a standalone function in summariseByExon.R
Impact of Changes
Briefly describe the impact of the implemented changes.
Generates a new SummarisedExperiment object containing transcript counts at exon level. Does not affect existing bambu codebase
Checklist
Note: Documentation will be updated later