Summarise by exon by ch99l · Pull Request #558 · GoekeLab/bambu

ch99l · 2026-04-15T06:48:56Z

Related Issue

Fixes # (issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (please specify if the change breaks existing functionality)
- Non breaking change (the feature doesn't change existing functionality)
- Breaking change (the feature that would cause existing functionality to not work as expected)
Documentation update
Performance optimization

Description

Add summariseByExon() function (function is found in newly created summariseByExon.R script)

Adds a new exported utility function summariseByExon() that aggregates transcript-level expression from a bambu SummarizedExperiment down to unique exon-level counts.

How it works:

Unlists the transcript GRangesList to get individual exon-transcript pairs
Defines unique exons by coordinate identity (seqnames:start:end:strand)
Builds a sparse binary membership matrix (unique exons × transcripts) and multiplies it against the transcript count matrix to efficiently sum counts across all transcripts sharing each exon
Returns a RangedSummarizedExperiment with one row per unique exonic locus, preserving colData from the input and carrying GENEID (comma-separated for exons shared across genes) as row metadata

Returns: A RangedSummarizedExperiment where counts are for each unique exon across all overlapping transcripts.

Implementation Details

Code change does not affect main bambu code as it is implemented as a standalone function in summariseByExon.R

Impact of Changes

Briefly describe the impact of the implemented changes.
Generates a new SummarisedExperiment object containing transcript counts at exon level. Does not affect existing bambu codebase

Checklist

My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation (vignettes, man pages).
Note: Documentation will be updated later
I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

Copilot

Pull request overview

Adds a new user-facing helper to aggregate Bambu transcript-level counts to exon-level counts, returning a new SummarizedExperiment/RangedSummarizedExperiment keyed by unique exon coordinates.

Changes:

Introduces summariseByExon() to collapse transcript counts onto unique exonic loci (seqnames/start/end/strand).
Constructs exon-level GRanges and count matrix output as a new SummarizedExperiment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T06:54:02Z

+
+    # Return as SummarizedExperiment
+    return(SummarizedExperiment(
+        assays   = list(counts = exonCounts),


SummarizedExperiment(assays = ...) is passed a base list, but the rest of the codebase consistently uses SimpleList(...) for assays when constructing SummarizedExperiment objects. Using SimpleList here would keep the return value consistent with other constructors and avoid any subtle differences in downstream behavior.

Suggested change

assays = list(counts = exonCounts),

assays = S4Vectors::SimpleList(counts = exonCounts),

Copilot · 2026-04-15T06:54:02Z

+    exonDt <- data.table(
+        TXNAME   = names(exonRanges),


TXNAME = names(exonRanges) is unlikely to match transcript IDs in rownames(se) because unlist(rowRanges(se), use.names=TRUE) typically produces names like "<TXNAME>.<exon_index>" when the per-exon ranges are unnamed. This will make match(exonDt$TXNAME, txNames) return NA, and sparseMatrix(i=..., j=...) will error or silently drop mappings. Consider deriving transcript IDs explicitly (e.g., via stack(rowRanges(se))$ind, or rep(names(rowRanges(se)), elementNROWS(rowRanges(se)))) and use that value for TXNAME.

Suggested change

exonDt <- data.table(

TXNAME = names(exonRanges),

exonTxNames <- rep(txNames, lengths(rowRanges(se)))

exonDt <- data.table(

TXNAME = exonTxNames,

i think original code is good, names(exonRanges) is simpler

Copilot · 2026-04-15T06:54:02Z

+#' @importFrom GenomicRanges GRanges seqnames start end strand
+#' @importFrom IRanges IRanges
+#' @export
+summariseByExon <- function(se) {


This new exported function introduces non-trivial aggregation logic (mapping transcript-by-exon membership and summing counts) but there are no accompanying unit tests. Please add a testthat test that runs summariseByExon() on an existing extdata se object and asserts basic invariants (e.g., output class, column data preserved, and a few known exon-level totals).

Covers end-to-end execution on the bundled fixture SE and verifies the exon-aggregation identity against synthetic Poisson counts, including a per-exon hand-check on the most-shared exon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cying111 · 2026-04-15T08:12:22Z

I just asked claude code to review the PR and test the PR, so it created unit tests for this which I have already committed to the summarise-by-exon branch.

And here is a summary of the issues identified by claude code, my summary is maybe can ignore the MInor issues, but the blocking issues and worth addressing issues can fix.

Prompt I used:
help me review this PR: #558, document your footprint inside the folder, and test with small testdata

Blocking / should fix before merge

NAMESPACE and man page not regenerated. The @export tag was added but
NAMESPACE has no export(summariseByExon) and man/ has no
summariseByExon.Rd. Run devtools::document() / roxygen2::roxygenise()
and commit the results — otherwise the function isn't actually exported
from the installed package.
No unit test. tests/testthat/ has a test file per exported function
(e.g. test_transcriptToGeneExpression.R) but none for this one. A minimal
test using the same fixture SE would match project conventions and catch
regressions on the aggregation identity above.
assays = list(counts = …) in the returned SE — the rest of the
codebase uses S4Vectors::SimpleList(...) (see
R/transcriptToGeneExpression.R:47). Works (SE coerces) but inconsistent.
Copilot flagged the same.

Worth addressing

No input validation. The function assumes rowData(se)$GENEID exists
and that assays(se)$counts is present and numeric. Passing anything else
fails with confusing downstream errors. A short
stopifnot(is(se, "RangedSummarizedExperiment"), "counts" %in% assayNames(se), "GENEID" %in% colnames(rowData(se)))
would be enough.
Ignores metadata(se)$incompatibleCounts and $nonuniqueCounts.
transcriptToGeneExpression folds these in before aggregating. Whether
that's desired here is a design call, but at minimum the docstring should
say the exon-level counts do not include incompatible/nonunique reads,
so users aren't surprised that sum(exonCounts) != sum(assayed reads).
No CPM assay. transcriptToGeneExpression returns counts + CPM; this
returns counts only. Fine, but worth being explicit in the doc.
Roxygen @examples missing. Other exported functions carry a runnable
example that loads the fixture SE — easy to add here and gives R CMD check
coverage of the example path.
@import data.table — transcriptToGeneExpression does the same, so
this is just matching existing style, but Bioconductor review usually
prefers @importFrom data.table data.table := to narrow the import.

Minor

Exon rownames are chr:start:end:strand strings — readable but long. Fine
as-is; document the format.
If a transcript's rowRanges has no names, unlist(..., use.names = TRUE)
produces NA names, which would then cascade into exonDt$TXNAME. Not a
real concern for bambu output (txs are always named) but trivially
defensible via the validation step above.
PR description says "changes remain isolated … do not modify core Bambu
code." True for the code, but NAMESPACE/man files do need to change.

jonathangoeke · 2026-04-16T02:33:51Z

Code review

Found 8 issues (scored 0-100 for confidence): (JG:threshold is 80, all below threshold)

NAMESPACE not regenerated -- function is not actually exported (score: 75). @export tag is present but devtools::document() / roxygen2::roxygenise() was not run. No export(summariseByExon) in NAMESPACE and no man/summariseByExon.Rd. The function will not be accessible after library(bambu).

bambu/R/summariseByExon.R

Lines 13 to 15 in e6c5547

    
           #' @importFrom IRanges IRanges 
        
           #' @export 
        
           summariseByExon <- function(se) {

Genomic sort removed during refactoring -- regression (score: 75). Commit b4277ca had result <- result[order(seqnames, start, end)]. Commit dbe2af9 refactored to SummarizedExperiment and silently dropped the sort. Output is now in data.table first-occurrence order, not genomic order. Bioconductor convention expects sorted GRanges in a RangedSummarizedExperiment.

bambu/R/summariseByExon.R

Lines 65 to 81 in e6c5547

    
               # Build GRanges for unique exons 
        
               exonGRanges <- GRanges( 
        
                   seqnames = exonMeta$seqnames, 
        
                   ranges   = IRanges(start = exonMeta$start, end = exonMeta$end), 
        
                   strand   = exonMeta$strand 
        
               ) 
        
               names(exonGRanges) <- exonMeta$exon_id 
        
               mcols(exonGRanges)$GENEID <- exonMeta$GENEID 
        
               # Return as SummarizedExperiment 
        
               return(SummarizedExperiment( 
        
                   assays   = list(counts = exonCounts), 
        
                   rowRanges = exonGRanges, 
        
                   colData  = colData(se) 
        
               )) 
        
           }

Only counts assay propagated; CPM, fullLengthCounts, uniqueCounts silently dropped (score: 50). transcriptToGeneExpression returns both counts and CPM. This function returns only counts without warning. May be intentional but is undocumented divergence from the analogous function.

bambu/R/summariseByExon.R

Lines 62 to 64 in e6c5547

    
           # Aggregate counts: unique_exons x samples 
        
           txCounts   <- assays(se)$counts 
        
           exonCounts <- exonTxMat %*% txCounts

list() instead of SimpleList() for assays (score: 50). Every other SE constructor in the codebase uses S4Vectors::SimpleList. Even this function's own test uses SimpleList. Functionally equivalent due to auto-coercion, but inconsistent with codebase convention.

bambu/R/summariseByExon.R

Lines 76 to 78 in e6c5547

    
           return(SummarizedExperiment( 
        
               assays   = list(counts = exonCounts), 
        
               rowRanges = exonGRanges,

@return claims RangedSummarizedExperiment but class is not imported (score: 50). The @importFrom on line 11 imports only SummarizedExperiment (the constructor). The analogous function avoids this by saying @return A SummarizedExperiment object. Minor documentation pedantry.

bambu/R/summariseByExon.R

Lines 3 to 11 in e6c5547

    
           #' @param se a \code{SummarizedExperiment} object from \code{\link{bambu}} 
        
           #' @return A \code{RangedSummarizedExperiment} with exon-level counts 
        
           #' @details Counts are summed across all transcripts that share the same exon 
        
           #'   (defined by identical seqnames, start, end, and strand). The returned 
        
           #'   counts therefore represent the total evidence attributed to each unique 
        
           #'   exonic locus across all overlapping transcripts. 
        
           #' @import data.table 
        
           #' @importFrom Matrix sparseMatrix 
        
           #' @importFrom SummarizedExperiment assays rowRanges rowData colData SummarizedExperiment

No @examples block for exported function (score: 25). Every other exported function in the package has @examples. BiocCheck will warn. The test file already contains the exact idiom needed.

bambu/R/summariseByExon.R

Lines 1 to 14 in e6c5547

    
           #' Summarise transcript expression to exon-level expression 
        
           #' @title summarise by exon 
        
           #' @param se a \code{SummarizedExperiment} object from \code{\link{bambu}} 
        
           #' @return A \code{RangedSummarizedExperiment} with exon-level counts 
        
           #' @details Counts are summed across all transcripts that share the same exon 
        
           #'   (defined by identical seqnames, start, end, and strand). The returned 
        
           #'   counts therefore represent the total evidence attributed to each unique 
        
           #'   exonic locus across all overlapping transcripts. 
        
           #' @import data.table 
        
           #' @importFrom Matrix sparseMatrix 
        
           #' @importFrom SummarizedExperiment assays rowRanges rowData colData SummarizedExperiment 
        
           #' @importFrom GenomicRanges GRanges seqnames start end strand 
        
           #' @importFrom IRanges IRanges 
        
           #' @export

incompatibleCounts / nonuniqueCounts not included (score: 25). transcriptToGeneExpression adds these before aggregating. However, since incompatible reads cannot be attributed to specific exons, omitting them is arguably correct.

bambu/R/summariseByExon.R

Lines 62 to 64 in e6c5547

    
           # Aggregate counts: unique_exons x samples 
        
           txCounts   <- assays(se)$counts 
        
           exonCounts <- exonTxMat %*% txCounts

colData rownames not realigned to assay column names (score: 0, false positive). Matrix multiplication preserves column names, so colData(se) rownames already match.

bambu/R/summariseByExon.R

Lines 78 to 80 in e6c5547

    
               rowRanges = exonGRanges, 
        
               colData  = colData(se) 
        
           ))

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

jonathangoeke

specific comments:
-names are not very informative. do we need names? this should ideally be official exonIds, which we might not track in bambu? align with other summarisefunctions
-mcols: need to have a map of exon to transcripts.
-mcols: add if novel exon or annotated exon (only present in novel transcripts vs at least one annotated transcript)
-provide option which assays to summarise (by default all?)

general comments/combine this PR with other updates later?

do not add extra file. instead combined with existing file. rename file to summariseExpression, and summariseExpression_utlity?
make sure that code is consistently implemented, use same style reuse code if possible (e.g same as transcriptToGeneexpression)
add example in documentation/ user facing functions should be well documented
add documentation, or create draft issue that this needs to be documented later
one function to summariseExpression('exons/transcript/start/end')? or individual functions? they should use same structure to avoid uplicated code. e.g define summary-map, summarise assays, summarise ranges, summarise mcol, ...? if possible
use dplyr to be consistent with major code in bambu

unit test:
-document unit test lines, what is tested? why is this relevant? if the unit test function needs to be clear otherwise not clear if fails and difficult to maintain

unit test that just reimplement the code are not so useful. It might be helpful to define a manual examples of 2 genes with 1 trancript and with 3 transcripts and multpile exons, and transcripts counts, and define expected output e.g. 5 rows, 3 read counts
-unit test should cover multiple samples (min 2)
-current unit test object has zero count is that meaningful?

ch99l added 2 commits April 14, 2026 15:19

Add summariseByExon function

b4277ca

Return exon-level SummarisedExperiment instead of data.table

dbe2af9

ch99l requested review from SuiYue-2308, Copilot and jonathangoeke April 15, 2026 06:49

Copilot started reviewing on behalf of ch99l April 15, 2026 06:49 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

ch99l requested a review from cying111 April 15, 2026 08:01

Add tests for summariseByExon

e6c5547

Covers end-to-end execution on the bundled fixture SE and verifies the exon-aggregation identity against synthetic Poisson counts, including a per-exon hand-check on the most-shared exon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jonathangoeke reviewed Apr 16, 2026

View reviewed changes

ch99l linked an issue Apr 16, 2026 that may be closed by this pull request

summarise by exon count #563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarise by exon#558

Summarise by exon#558
ch99l wants to merge 3 commits intodevel_pre_v4from
summarise-by-exon

ch99l commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

cying111 Apr 16, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

cying111 commented Apr 15, 2026 •

edited

Loading

Uh oh!

jonathangoeke commented Apr 16, 2026 •

edited

Loading

Uh oh!

jonathangoeke left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	assays = list(counts = exonCounts),
	assays = S4Vectors::SimpleList(counts = exonCounts),

Conversation

ch99l commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Type of Change

Description

Implementation Details

Impact of Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Blocking / should fix before merge

Worth addressing

Minor

Uh oh!

jonathangoeke commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

jonathangoeke left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ch99l commented Apr 15, 2026 •

edited

Loading

cying111 commented Apr 15, 2026 •

edited

Loading

jonathangoeke commented Apr 16, 2026 •

edited

Loading