Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions R/summariseByExon.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#' Summarise transcript expression to exon-level expression
#' @title summarise by exon
#' @param se a \code{SummarizedExperiment} object from \code{\link{bambu}}
#' @return A \code{RangedSummarizedExperiment} with exon-level counts
#' @details Counts are summed across all transcripts that share the same exon
#' (defined by identical seqnames, start, end, and strand). The returned
#' counts therefore represent the total evidence attributed to each unique
#' exonic locus across all overlapping transcripts.
#' @import data.table
#' @importFrom Matrix sparseMatrix
#' @importFrom SummarizedExperiment assays rowRanges rowData colData SummarizedExperiment
#' @importFrom GenomicRanges GRanges seqnames start end strand
#' @importFrom IRanges IRanges
#' @export
summariseByExon <- function(se) {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new exported function introduces non-trivial aggregation logic (mapping transcript-by-exon membership and summing counts) but there are no accompanying unit tests. Please add a testthat test that runs summariseByExon() on an existing extdata se object and asserts basic invariants (e.g., output class, column data preserved, and a few known exon-level totals).

Copilot uses AI. Check for mistakes.
# Unlist GRangesList: one row per exon-transcript combination
exonRanges <- unlist(rowRanges(se), use.names = TRUE)

# Build a data.table of exon-transcript pairs
txNames <- rownames(se)
exonDt <- data.table(
TXNAME = names(exonRanges),
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TXNAME = names(exonRanges) is unlikely to match transcript IDs in rownames(se) because unlist(rowRanges(se), use.names=TRUE) typically produces names like "<TXNAME>.<exon_index>" when the per-exon ranges are unnamed. This will make match(exonDt$TXNAME, txNames) return NA, and sparseMatrix(i=..., j=...) will error or silently drop mappings. Consider deriving transcript IDs explicitly (e.g., via stack(rowRanges(se))$ind, or rep(names(rowRanges(se)), elementNROWS(rowRanges(se)))) and use that value for TXNAME.

Suggested change
exonDt <- data.table(
TXNAME = names(exonRanges),
exonTxNames <- rep(txNames, lengths(rowRanges(se)))
exonDt <- data.table(
TXNAME = exonTxNames,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think original code is good, names(exonRanges) is simpler

seqnames = as.character(seqnames(exonRanges)),
start = start(exonRanges),
end = end(exonRanges),
strand = as.character(strand(exonRanges))
)

# Unique exon key: seqnames:start:end:strand
exonDt[, exon_id := paste(seqnames, start, end, strand, sep = ":")]

# Attach GENEID from rowData
geneDt <- data.table(
TXNAME = rownames(se),
GENEID = rowData(se)$GENEID
)
exonDt <- geneDt[exonDt, on = "TXNAME"]

# Collapse metadata per unique exon
exonMeta <- exonDt[, .(
seqnames = seqnames[1],
start = start[1],
end = end[1],
strand = strand[1],
GENEID = paste(sort(unique(GENEID)), collapse = ",")
), by = exon_id]

# Build sparse binary matrix: unique_exons x transcripts
# entry [i, j] = 1 if transcript j contains unique exon i
uniqueExons <- exonMeta$exon_id
exonIdx <- match(exonDt$exon_id, uniqueExons)
txIdx <- match(exonDt$TXNAME, txNames)

exonTxMat <- sparseMatrix(
i = exonIdx,
j = txIdx,
x = 1L,
dims = c(length(uniqueExons), length(txNames)),
dimnames = list(uniqueExons, txNames)
)

# Aggregate counts: unique_exons x samples
txCounts <- assays(se)$counts
exonCounts <- exonTxMat %*% txCounts

# Build GRanges for unique exons
exonGRanges <- GRanges(
seqnames = exonMeta$seqnames,
ranges = IRanges(start = exonMeta$start, end = exonMeta$end),
strand = exonMeta$strand
)
names(exonGRanges) <- exonMeta$exon_id
mcols(exonGRanges)$GENEID <- exonMeta$GENEID

# Return as SummarizedExperiment
return(SummarizedExperiment(
assays = list(counts = exonCounts),
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SummarizedExperiment(assays = ...) is passed a base list, but the rest of the codebase consistently uses SimpleList(...) for assays when constructing SummarizedExperiment objects. Using SimpleList here would keep the return value consistent with other constructors and avoid any subtle differences in downstream behavior.

Suggested change
assays = list(counts = exonCounts),
assays = S4Vectors::SimpleList(counts = exonCounts),

Copilot uses AI. Check for mistakes.
rowRanges = exonGRanges,
colData = colData(se)
))
}
Loading