Skip to content

fix: adjust NDR threshold behavior. Previously, NDR = 1 would overrid…#559

Open
SuiYue-2308 wants to merge 3 commits intodevel_pre_v4from
devel_pre_v4_fix_NDR_1
Open

fix: adjust NDR threshold behavior. Previously, NDR = 1 would overrid…#559
SuiYue-2308 wants to merge 3 commits intodevel_pre_v4from
devel_pre_v4_fix_NDR_1

Conversation

@SuiYue-2308
Copy link
Copy Markdown
Collaborator

@SuiYue-2308 SuiYue-2308 commented Apr 15, 2026

fix: adjust NDR threshold behavior

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Description

Previously, NDR = 1 would override remove.subset and min.sampleNumber. Now, NDR = 0.999 and NDR = 1 produce identical results except for the NDR threshold itself.

Implementation Details

Before, if the read class is a subset or low confidence transcript (can not pass the min.sampleNumber), it's NDR will be assigned to 1, which will be further reported when setting NDR = 1. Now, I set their NDR to NA, and add another check step before report novel transcript (!is.na(NDR)) to make sure the subset and low confidence are not included.

Impact of Changes

Briefly describe the impact of the implemented changes. For example:

  • Results are identical to the previous version when NDR = 0.999.
  • Now "NDR = 1" will not report subset and low confidence transcripts

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

…e `remove.subset` and `min.sampleNumber`. Now, NDR = 0.999 and NDR = 1 produce identical results except for the NDR threshold itself.
@SuiYue-2308 SuiYue-2308 linked an issue Apr 15, 2026 that may be closed by this pull request
@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 15, 2026

Adding in my PR review comment here (also with assistance of claude code), my summary is issue 1, 3 worth noting, issue 4 is a comment adding, maybe good to have, but also fine to not have

Prompt used:
now help me review this PR: #559

Worth addressing before merge

  1. bambu.R:164 sets NDR <- 1 for mode == "fusion". This fix
    therefore changes fusion-mode behaviour: fusion mode now respects
    remove.subsetTx and min.sampleNumber, where previously it didn't.
    That may be correct (and may be exactly what refine the NDR usage #560 wants) but it's not
    mentioned in the PR description. Either:

    • confirm it's intended and call it out in the PR body / NEWS, or
    • switch fusion mode to a sentinel that preserves the old permissive
      behaviour (e.g. pass NDR = Inf plus a separate remove.subsetTx = FALSE in fusion mode).
  2. No test. The PR's central claim — "NDR = 0.999 and NDR = 1 produce
    identical results except for the NDR threshold itself" — is easy to
    encode as a testthat expectation against the existing fixture
    (seIsoReCombined_SGNex_..._chr9_1_1000000.rds + the annotation GR).
    The table above is effectively that test; porting it would guard against
    a silent regression.

  3. test_setNDR.R line 68–82 (pre-existing). The block
    test_that("setNDR works when no lowConfidenceTranscripts are present", ...)
    computes extendedAnnotations via isore.extendAnnotations(..., NDR = 1, ...)
    but then asserts on the pre-saved extendedAnnotationsExpected fixture
    — the freshly computed extendedAnnotations is never used. Not a
    blocker for this PR, but this was the test most likely to have caught
    the pre-fix bug, and it silently passed because of that oversight. A
    follow-up to rewire that test against the computed object is worth a
    separate commit.

Minor

  1. The new predicate !is.na(x) & x <= NDR is a clean way to kill off NAs,
    but it's worth a brief code comment that NA NDR = "deliberately filtered"
    (subset or scoreless) so a future reader doesn't "simplify" it back.

@SuiYue-2308 SuiYue-2308 requested a review from cying111 April 15, 2026 09:07
@jonathangoeke
Copy link
Copy Markdown
Member

Code review

Top 3 issues by confidence score (lower threshold than usual):

  1. (score: 75) setNDR() includeRef=FALSE branch lacks !is.na() guards. With this PR, transcripts in lowConfidenceTranscripts can have NDR = NA. The includeRef=TRUE branch guards against NA, but the default includeRef=FALSE branch does not — NA <= NDR propagates NA into toAdd/toRemove, then into logical subsetting of the GRangesList, which can produce corrupted/NA entries.

#If reference annotations should be filtered too (note that reference annotations with no read support arn't filtered)
if(includeRef){
toRemove <- (!is.na(mcols(extendedAnnotations)$NDR) & mcols(extendedAnnotations)$NDR > NDR)
toAdd <- !is.na(mcols(metadata(extendedAnnotations)$lowConfidenceTranscripts)$NDR) &
mcols(metadata(extendedAnnotations)$lowConfidenceTranscripts)$NDR <= NDR
} else {
toRemove <- (mcols(extendedAnnotations)$NDR > NDR &
grepl(prefix, mcols(extendedAnnotations)$TXNAME))
toAdd <- (mcols(metadata(extendedAnnotations)$lowConfidenceTranscripts)$NDR <= NDR &
grepl(prefix, mcols(metadata(extendedAnnotations)$lowConfidenceTranscripts)$TXNAME))
}

  1. (score: 50) Fusion mode side effect. bambu.R sets NDR <- 1 for fusion mode. Combined with the new filter at utilityExtend.R:115 (which now drops NDR == NA rows), fusion-mode output may differ from prior behavior in ways that aren't covered by tests.

bambu/R/bambu.R

Lines 161 to 167 in 25aa679

processByChromosome <- TRUE
}
if(mode == "fusion"){
NDR <- 1
fusionMode <- TRUE
}
if(mode == "debug"){

  1. (score: 25) Semantic change to prior intent. Commit 0b47cb8 introduced NDR <- 1 for subset/low-confidence transcripts with the stated purpose "Report all possible transcripts if NDR = 1". This PR reverses that intent. Likely intentional given the PR description, but worth calling out explicitly so downstream users relying on NDR = 1 to recover all candidates know the behavior has changed.

message("NDR will be approximated as: (1 - Transcript Model Prediction Score)")
} else {
combinedTranscripts$NDR <- calculateNDR(combinedTranscripts$maxTxScore, equal)
}
combinedTranscripts$NDR[combinedTranscripts$maxTxScore==-1] <- NA
return(combinedTranscripts)
}

Only #1 would have cleared the default score threshold (80); #2 and #3 are included at the user's request for a lower-threshold review.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Member

@jonathangoeke jonathangoeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you confirm that you can reproduce results with NDR=1 (previous version) and NDR=1 (including subset and low confidence read classes)? Which parameters need to be selected to reproduce this behaviour?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts how Novel Discovery Rate (NDR) thresholding behaves during annotation extension so that NDR = 1 no longer unintentionally re-includes subset / low-confidence transcripts (e.g., those failing remove.subset or min.sampleNumber filters).

Changes:

  • Treat transcripts marked as filtered (maxTxScore == -1) as NDR = NA instead of NDR = 1.
  • Exclude NDR = NA transcripts from NDR-threshold filtering while still retaining "equal:compatible" classes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

combinedTranscripts$NDR <- calculateNDR(combinedTranscripts$maxTxScore, equal)
}
combinedTranscripts$NDR[combinedTranscripts$maxTxScore==-1] <- 1
combinedTranscripts$NDR[combinedTranscripts$maxTxScore==-1] <- NA
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing filtered transcripts' NDR from 1 to NA means downstream thresholding logic must be NA-safe. In setNDR() (same file), the includeRef = FALSE branch builds toRemove/toAdd without !is.na(...), so transcripts with NDR = NA in metadata(extendedAnnotations)$lowConfidenceTranscripts can yield NA logical indices and break or mis-subset GRanges/GRangesList objects. Update those conditions to treat NA as FALSE (mirror the includeRef = TRUE branch by adding !is.na(mcols(...)$NDR) guards).

Copilot uses AI. Check for mistakes.
Comment on lines +115 to 116
filterSet <- ((!is.na(rowDataCombined$NDR) & rowDataCombined$NDR <= NDR) | rowDataCombined$readClassType == "equal:compatible")
lowConfidenceTranscripts <- combindRowDataWithRanges(
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change alters the edge-case semantics for NDR = 1 vs NDR = 0.999 by excluding transcripts with NDR = NA (subset / min.sampleNumber-filtered). There isn’t currently a regression test covering this behavior (i.e., that NDR = 1 no longer re-includes subset/low-confidence transcripts, while NDR = 0.999 matches previous results). Consider adding a test that runs isore.extendAnnotations() with NDR = 1 and asserts subset/low-confidence transcripts are still excluded (e.g., via metadata(... )$lowConfidenceTranscripts / counts).

Copilot uses AI. Check for mistakes.
@SuiYue-2308
Copy link
Copy Markdown
Collaborator Author

SuiYue-2308 commented Apr 16, 2026

Did you confirm that you can reproduce results with NDR=1 (previous version) and NDR=1 (including subset and low confidence read classes)? Which parameters need to be selected to reproduce this behaviour?

Hi @jonathangoeke, Yes, we can get the exact same result with previous NDR = 1 by setting opt.discovery = list(remove.subsetTx = FALSE, min.readCount = 1, min.sampleNumber = 0) to include the subset and low confidence.

combinedTranscripts$NDR <- calculateNDR(combinedTranscripts$maxTxScore, equal)
}
combinedTranscripts$NDR[combinedTranscripts$maxTxScore==-1] <- 1
combinedTranscripts$NDR[combinedTranscripts$maxTxScore==-1] <- NA
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar comment as copilot, if set NDR equal to NA, then following commands related to NDR need to be NA aware. Need to check on this, and make sure all following commands pass.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this, different scenarios have been tested, and all works as expected:
testing for isore.extendAnnotations reported set of transcripts are expectedly corrected to have subsetTranscripts kept even if rm.subsetTx = TRUE and all filtering pre-set thresholds
testing for whole bambu shows that quantification results impacted, cause reporting set of transcripts changed at last

Also one thing is that this fix does not only fix the override of return.subsetTx and min.SampleNumber, and also other base filtering, maybe can consider to make this more explicit

@SuiYue-2308
Copy link
Copy Markdown
Collaborator Author

SuiYue-2308 commented Apr 16, 2026

Hi @cying111 @jonathangoeke

Following the comments, I updated this pull request.

  1. Fusion mode: previously, the fusion mode will set NDR = 1, which will include everything. Now I add the following setting to the fusion mode to return the same result with fusion mode.
opt.discovery$remove.subsetTx <- FALSE
opt.discovery$min.readCount <- 1
opt.discovery$min.sampleNumber <- 0

I have tested the fusion mode on a real data set, the rowRanges and the assays are identical, the only thing is that this branch not keep the subset in the metadata(rowRanges(se)). If we want to keep subset need to fix the fusion mode in future.

  1. includeRef
    I have removed this parameter since we never use this.

  2. the third part from @jonathangoeke generated by code agent is not that clear to me.

@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 16, 2026

Hi @SuiYue-2308 , I think you are right, min.readCount is needed only for the change, but the se object changed in the end,
I think NDR, maxTxScore, maxTxScore.noFit changes are expected because change from 1 to NA, would cause these differences
The rowRanges(se_v4)$metadata vs rowRanges(se_pr)$metadata are details below as well, which I think are expected as well.

Error: Expected se_v4 to equal se_pr.
Differences:
Attributes: < Component “rowRanges”: Attributes: < Component “elementMetadata”: Attributes: < Component “listData”: Component “NDR”: Mean relative difference: 0.1815954 > > >
Attributes: < Component “rowRanges”: Attributes: < Component “elementMetadata”: Attributes: < Component “listData”: Component “maxTxScore”: Mean relative difference: 1.181595 > > >
Attributes: < Component “rowRanges”: Attributes: < Component “elementMetadata”: Attributes: < Component “listData”: Component “maxTxScore.noFit”: Mean relative difference: 1.181595 > > >
Attributes: < Component “rowRanges”: Attributes: < Component “metadata”: Names: 1 string mismatch > >
Attributes: < Component “rowRanges”: Attributes: < Component “metadata”: Length mismatch: comparison on first 2 components > >
Attributes: < Component “rowRanges”: Attributes: < Component “metadata”: Component 2: Lengths: 17, 0

> rowRanges(se_v4)@metadata
$NDRthreshold
[1] 1

$subsetTranscripts
GRangesList object of length 17:
$BambuTx1
GRanges object with 3 ranges and 2 metadata columns:
      seqnames      ranges strand | exon_rank exon_endRank
         <Rle>   <IRanges>  <Rle> | <integer>    <integer>
  [1]        9 14513-16876      - |         3            1
  [2]        9 16969-17166      - |         2            2
  [3]        9 17344-17410      - |         1            3
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$BambuTx3
GRanges object with 3 ranges and 2 metadata columns:
      seqnames      ranges strand | exon_rank exon_endRank
         <Rle>   <IRanges>  <Rle> | <integer>    <integer>
  [1]        9 14517-14940      - |         3            1
  [2]        9 15081-15149      - |         2            2
  [3]        9 15909-16064      - |         1            3
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$BambuTx5
GRanges object with 2 ranges and 2 metadata columns:
      seqnames      ranges strand | exon_rank exon_endRank
         <Rle>   <IRanges>  <Rle> | <integer>    <integer>
  [1]        9 14521-14940      - |         2            1
  [2]        9 15081-15750      - |         1            2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

...
<14 more elements>

$lowConfidenceTranscripts
GRangesList object of length 0:
<0 elements>

> rowRanges(se_pr)@metadata
$NDRthreshold
[1] 1

$lowConfidenceTranscripts
GRangesList object of length 0:
<0 elements>

@SuiYue-2308
Copy link
Copy Markdown
Collaborator Author

Thanks for the code review!

I will merge this pull request now, and we will need to look into the metadata to decide whether we still want to keep the subset and low confidence transcript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refine the NDR usage

4 participants