Skip to content

Summarise by exon#558

Open
ch99l wants to merge 3 commits intodevel_pre_v4from
summarise-by-exon
Open

Summarise by exon#558
ch99l wants to merge 3 commits intodevel_pre_v4from
summarise-by-exon

Conversation

@ch99l
Copy link
Copy Markdown
Collaborator

@ch99l ch99l commented Apr 15, 2026

Related Issue

Fixes # (issue number)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (please specify if the change breaks existing functionality)
    • Non breaking change (the feature doesn't change existing functionality)
    • Breaking change (the feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance optimization

Description

Add summariseByExon() function (function is found in newly created summariseByExon.R script)

Adds a new exported utility function summariseByExon() that aggregates transcript-level expression from a bambu SummarizedExperiment down to unique exon-level counts.

How it works:

  • Unlists the transcript GRangesList to get individual exon-transcript pairs
  • Defines unique exons by coordinate identity (seqnames:start:end:strand)
  • Builds a sparse binary membership matrix (unique exons × transcripts) and multiplies it against the transcript count matrix to efficiently sum counts across all transcripts sharing each exon
  • Returns a RangedSummarizedExperiment with one row per unique exonic locus, preserving colData from the input and carrying GENEID (comma-separated for exons shared across genes) as row metadata

Returns: A RangedSummarizedExperiment where counts are for each unique exon across all overlapping transcripts.

Implementation Details

Code change does not affect main bambu code as it is implemented as a standalone function in summariseByExon.R

Impact of Changes

Briefly describe the impact of the implemented changes.
Generates a new SummarisedExperiment object containing transcript counts at exon level. Does not affect existing bambu codebase

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation (vignettes, man pages).
    Note: Documentation will be updated later
  • I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new user-facing helper to aggregate Bambu transcript-level counts to exon-level counts, returning a new SummarizedExperiment/RangedSummarizedExperiment keyed by unique exon coordinates.

Changes:

  • Introduces summariseByExon() to collapse transcript counts onto unique exonic loci (seqnames/start/end/strand).
  • Constructs exon-level GRanges and count matrix output as a new SummarizedExperiment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/summariseByExon.R

# Return as SummarizedExperiment
return(SummarizedExperiment(
assays = list(counts = exonCounts),
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SummarizedExperiment(assays = ...) is passed a base list, but the rest of the codebase consistently uses SimpleList(...) for assays when constructing SummarizedExperiment objects. Using SimpleList here would keep the return value consistent with other constructors and avoid any subtle differences in downstream behavior.

Suggested change
assays = list(counts = exonCounts),
assays = S4Vectors::SimpleList(counts = exonCounts),

Copilot uses AI. Check for mistakes.
Comment thread R/summariseByExon.R
Comment on lines +21 to +22
exonDt <- data.table(
TXNAME = names(exonRanges),
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TXNAME = names(exonRanges) is unlikely to match transcript IDs in rownames(se) because unlist(rowRanges(se), use.names=TRUE) typically produces names like "<TXNAME>.<exon_index>" when the per-exon ranges are unnamed. This will make match(exonDt$TXNAME, txNames) return NA, and sparseMatrix(i=..., j=...) will error or silently drop mappings. Consider deriving transcript IDs explicitly (e.g., via stack(rowRanges(se))$ind, or rep(names(rowRanges(se)), elementNROWS(rowRanges(se)))) and use that value for TXNAME.

Suggested change
exonDt <- data.table(
TXNAME = names(exonRanges),
exonTxNames <- rep(txNames, lengths(rowRanges(se)))
exonDt <- data.table(
TXNAME = exonTxNames,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think original code is good, names(exonRanges) is simpler

Comment thread R/summariseByExon.R
#' @importFrom GenomicRanges GRanges seqnames start end strand
#' @importFrom IRanges IRanges
#' @export
summariseByExon <- function(se) {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new exported function introduces non-trivial aggregation logic (mapping transcript-by-exon membership and summing counts) but there are no accompanying unit tests. Please add a testthat test that runs summariseByExon() on an existing extdata se object and asserts basic invariants (e.g., output class, column data preserved, and a few known exon-level totals).

Copilot uses AI. Check for mistakes.
@ch99l ch99l requested a review from cying111 April 15, 2026 08:01
Covers end-to-end execution on the bundled fixture SE and verifies the
exon-aggregation identity against synthetic Poisson counts, including a
per-exon hand-check on the most-shared exon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 15, 2026

I just asked claude code to review the PR and test the PR, so it created unit tests for this which I have already committed to the summarise-by-exon branch.

And here is a summary of the issues identified by claude code, my summary is maybe can ignore the MInor issues, but the blocking issues and worth addressing issues can fix.

Prompt I used:
help me review this PR: #558, document your footprint inside the folder, and test with small testdata

Blocking / should fix before merge

  1. NAMESPACE and man page not regenerated. The @export tag was added but
    NAMESPACE has no export(summariseByExon) and man/ has no
    summariseByExon.Rd. Run devtools::document() / roxygen2::roxygenise()
    and commit the results — otherwise the function isn't actually exported
    from the installed package.
  2. No unit test. tests/testthat/ has a test file per exported function
    (e.g. test_transcriptToGeneExpression.R) but none for this one. A minimal
    test using the same fixture SE would match project conventions and catch
    regressions on the aggregation identity above.
  3. assays = list(counts = …) in the returned SE — the rest of the
    codebase uses S4Vectors::SimpleList(...) (see
    R/transcriptToGeneExpression.R:47). Works (SE coerces) but inconsistent.
    Copilot flagged the same.

Worth addressing

  1. No input validation. The function assumes rowData(se)$GENEID exists
    and that assays(se)$counts is present and numeric. Passing anything else
    fails with confusing downstream errors. A short
    stopifnot(is(se, "RangedSummarizedExperiment"), "counts" %in% assayNames(se), "GENEID" %in% colnames(rowData(se)))
    would be enough.
  2. Ignores metadata(se)$incompatibleCounts and $nonuniqueCounts.
    transcriptToGeneExpression folds these in before aggregating. Whether
    that's desired here is a design call, but at minimum the docstring should
    say the exon-level counts do not include incompatible/nonunique reads,
    so users aren't surprised that sum(exonCounts) != sum(assayed reads).
  3. No CPM assay. transcriptToGeneExpression returns counts + CPM; this
    returns counts only. Fine, but worth being explicit in the doc.
  4. Roxygen @examples missing. Other exported functions carry a runnable
    example that loads the fixture SE — easy to add here and gives R CMD check
    coverage of the example path.
  5. @import data.tabletranscriptToGeneExpression does the same, so
    this is just matching existing style, but Bioconductor review usually
    prefers @importFrom data.table data.table := to narrow the import.

Minor

  1. Exon rownames are chr:start:end:strand strings — readable but long. Fine
    as-is; document the format.
  2. If a transcript's rowRanges has no names, unlist(..., use.names = TRUE)
    produces NA names, which would then cascade into exonDt$TXNAME. Not a
    real concern for bambu output (txs are always named) but trivially
    defensible via the validation step above.
  3. PR description says "changes remain isolated … do not modify core Bambu
    code." True for the code, but NAMESPACE/man files do need to change.

@jonathangoeke
Copy link
Copy Markdown
Member

jonathangoeke commented Apr 16, 2026

Code review

Found 8 issues (scored 0-100 for confidence): (JG:threshold is 80, all below threshold)

  1. NAMESPACE not regenerated -- function is not actually exported (score: 75). @export tag is present but devtools::document() / roxygen2::roxygenise() was not run. No export(summariseByExon) in NAMESPACE and no man/summariseByExon.Rd. The function will not be accessible after library(bambu).

#' @importFrom IRanges IRanges
#' @export
summariseByExon <- function(se) {

  1. Genomic sort removed during refactoring -- regression (score: 75). Commit b4277ca had result <- result[order(seqnames, start, end)]. Commit dbe2af9 refactored to SummarizedExperiment and silently dropped the sort. Output is now in data.table first-occurrence order, not genomic order. Bioconductor convention expects sorted GRanges in a RangedSummarizedExperiment.

# Build GRanges for unique exons
exonGRanges <- GRanges(
seqnames = exonMeta$seqnames,
ranges = IRanges(start = exonMeta$start, end = exonMeta$end),
strand = exonMeta$strand
)
names(exonGRanges) <- exonMeta$exon_id
mcols(exonGRanges)$GENEID <- exonMeta$GENEID
# Return as SummarizedExperiment
return(SummarizedExperiment(
assays = list(counts = exonCounts),
rowRanges = exonGRanges,
colData = colData(se)
))
}

  1. Only counts assay propagated; CPM, fullLengthCounts, uniqueCounts silently dropped (score: 50). transcriptToGeneExpression returns both counts and CPM. This function returns only counts without warning. May be intentional but is undocumented divergence from the analogous function.

# Aggregate counts: unique_exons x samples
txCounts <- assays(se)$counts
exonCounts <- exonTxMat %*% txCounts

  1. list() instead of SimpleList() for assays (score: 50). Every other SE constructor in the codebase uses S4Vectors::SimpleList. Even this function's own test uses SimpleList. Functionally equivalent due to auto-coercion, but inconsistent with codebase convention.

return(SummarizedExperiment(
assays = list(counts = exonCounts),
rowRanges = exonGRanges,

  1. @return claims RangedSummarizedExperiment but class is not imported (score: 50). The @importFrom on line 11 imports only SummarizedExperiment (the constructor). The analogous function avoids this by saying @return A SummarizedExperiment object. Minor documentation pedantry.

#' @param se a \code{SummarizedExperiment} object from \code{\link{bambu}}
#' @return A \code{RangedSummarizedExperiment} with exon-level counts
#' @details Counts are summed across all transcripts that share the same exon
#' (defined by identical seqnames, start, end, and strand). The returned
#' counts therefore represent the total evidence attributed to each unique
#' exonic locus across all overlapping transcripts.
#' @import data.table
#' @importFrom Matrix sparseMatrix
#' @importFrom SummarizedExperiment assays rowRanges rowData colData SummarizedExperiment

  1. No @examples block for exported function (score: 25). Every other exported function in the package has @examples. BiocCheck will warn. The test file already contains the exact idiom needed.

#' Summarise transcript expression to exon-level expression
#' @title summarise by exon
#' @param se a \code{SummarizedExperiment} object from \code{\link{bambu}}
#' @return A \code{RangedSummarizedExperiment} with exon-level counts
#' @details Counts are summed across all transcripts that share the same exon
#' (defined by identical seqnames, start, end, and strand). The returned
#' counts therefore represent the total evidence attributed to each unique
#' exonic locus across all overlapping transcripts.
#' @import data.table
#' @importFrom Matrix sparseMatrix
#' @importFrom SummarizedExperiment assays rowRanges rowData colData SummarizedExperiment
#' @importFrom GenomicRanges GRanges seqnames start end strand
#' @importFrom IRanges IRanges
#' @export

  1. incompatibleCounts / nonuniqueCounts not included (score: 25). transcriptToGeneExpression adds these before aggregating. However, since incompatible reads cannot be attributed to specific exons, omitting them is arguably correct.

# Aggregate counts: unique_exons x samples
txCounts <- assays(se)$counts
exonCounts <- exonTxMat %*% txCounts

  1. colData rownames not realigned to assay column names (score: 0, false positive). Matrix multiplication preserves column names, so colData(se) rownames already match.

rowRanges = exonGRanges,
colData = colData(se)
))

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Member

@jonathangoeke jonathangoeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specific comments:
-names are not very informative. do we need names? this should ideally be official exonIds, which we might not track in bambu? align with other summarisefunctions
-mcols: need to have a map of exon to transcripts.
-mcols: add if novel exon or annotated exon (only present in novel transcripts vs at least one annotated transcript)
-provide option which assays to summarise (by default all?)

general comments/combine this PR with other updates later?

  • do not add extra file. instead combined with existing file. rename file to summariseExpression, and summariseExpression_utlity?
  • make sure that code is consistently implemented, use same style reuse code if possible (e.g same as transcriptToGeneexpression)
  • add example in documentation/ user facing functions should be well documented
  • add documentation, or create draft issue that this needs to be documented later
  • one function to summariseExpression('exons/transcript/start/end')? or individual functions? they should use same structure to avoid uplicated code. e.g define summary-map, summarise assays, summarise ranges, summarise mcol, ...? if possible
  • use dplyr to be consistent with major code in bambu

unit test:
-document unit test lines, what is tested? why is this relevant? if the unit test function needs to be clear otherwise not clear if fails and difficult to maintain

  • unit test that just reimplement the code are not so useful. It might be helpful to define a manual examples of 2 genes with 1 trancript and with 3 transcripts and multpile exons, and transcripts counts, and define expected output e.g. 5 rows, 3 read counts
    -unit test should cover multiple samples (min 2)
    -current unit test object has zero count is that meaningful?

@ch99l ch99l linked an issue Apr 16, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

summarise by exon count

4 participants