You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -5,99 +5,212 @@ ITS subregion extraction for fungal metabarcoding at long-read scale.
5
5
As long-read amplicon sequencing (Oxford Nanopore and PacBio HiFi) becomes routine, extracting ITS subregions (ITS1, 5.8S, ITS2, full ITS) reliably at scale can become a throughput and robustness bottleneck. ITSxRust is a Rust-based ITS extractor that follows the standard approach of locating conserved ribosomal flanks using profile-HMMs (via HMMER), while adding long-read–oriented features for reproducible, high-throughput processing.
6
6
7
7
## Features
8
-
- HMMER/profile-HMM–based detection of conserved ribosomal flanks to extract ITS subregions
- Optional dereplication to reduce redundant HMMER searches
11
-
- Partial-chain fallback: recover subregions using two-anchor pairs when a full four-anchor chain is unavailable
12
-
- Structured failure diagnostics and QC summaries to help understand why reads were skipped or partially recovered
13
-
- Works with FASTA and FASTQ inputs
14
8
15
-
## Install
9
+
-**HMMER-based boundary detection** — locates conserved ribosomal flanks (SSU, 5.8S, LSU) using `nhmmer` profile-HMM searches to delimit ITS subregions
10
+
-**Platform presets** — `--preset ont` (tolerant E-values, wider length constraints) and `--preset hifi` (strict thresholds); explicit flags override any preset value
11
+
-**Partial-chain fallback** — when the full 4-anchor chain (SSU→5.8S_start→5.8S_end→LSU) is unavailable, recovers subregions from 2-anchor pairs (e.g., SSU+5.8S_start for ITS1)
12
+
-**Confidence classification** — each extracted read is labelled `confident`, `ambiguous`, or `partial` based on per-anchor score/E-value thresholds; ambiguous reads can be diverted to a separate file with `--write-ambiguous`
13
+
-**Exact dereplication** — `--derep` hashes identical sequences and searches only unique representatives, projecting results back to duplicates
14
+
-**Structured QC output** — `--qc-json` emits a per-sample JSON summary (read counts, skip-reason breakdown, parameters) suitable for MultiQC ingestion; `--anchors-tsv` / `--anchors-jsonl` export per-read anchor coordinates and confidence labels
15
+
-**Multi-region output** — extract ITS1, ITS2, full ITS, or all three simultaneously (`--region all`)
16
+
-**FASTA and FASTQ support** — reads gzipped or uncompressed inputs; preserves quality scores when outputting FASTQ
--output /data/its2_extracted.fasta --region its2 --preset ont
35
74
```
36
75
37
-
Or install into your Cargo bin dir:
76
+
### From source
77
+
78
+
Requires Rust (stable, edition 2024) and Cargo:
38
79
39
80
```bash
40
81
cargo install --path .
41
82
itsxrust --help
42
83
```
43
84
44
85
### Dependency: HMMER
45
-
ITSxRust coordinates HMMER searches (e.g., `hmmscan`) to locate ribosomal flanks. Ensure HMMER is available in your environment for typical extraction workflows.
46
86
47
-
## Usage
87
+
ITSxRust calls `nhmmer` to search profile-HMMs against input sequences. Install HMMER 3.x and ensure `nhmmer` is on your PATH:
88
+
89
+
```bash
90
+
conda install -c bioconda hmmer
91
+
```
92
+
93
+
The Docker image includes HMMER, so no separate installation is needed when using the container.
48
94
49
-
Help:
95
+
### HMM profiles
96
+
97
+
ITSxRust uses the same fungal HMM profiles as [ITSx](https://microbiology.se/software/itsx/). The file is typically called `F.hmm` (for fungi) and is distributed with ITSx. After installing ITSx, find it with:
--region all --preset ont --qc-json qc_summary.json
167
+
```
168
+
169
+
The JSON includes total reads, kept/skipped counts with reason-code breakdowns, confidence classification counts, dereplication stats (if `--derep`), and the effective parameters used.
170
+
171
+
Use `--anchors-tsv` or `--anchors-jsonl` to export per-read anchor hit coordinates, confidence labels, and ambiguous-reason annotations for reads that produced a valid chain.
172
+
173
+
## Inputs / outputs
65
174
66
175
**Inputs**
67
-
- FASTA / FASTQ (optionally gzipped)
68
-
- HMM model file (profile-HMMs for ribosomal flanks)
176
+
177
+
- FASTA or FASTQ, optionally gzipped
178
+
- HMM profile file (e.g., `F.hmm` from ITSx)
69
179
70
180
**Outputs**
71
-
- FASTA of extracted regions (ITS1 / ITS2 / full)
72
-
- Optional anchor/boundary outputs (TSV/JSONL) and QC summaries (if enabled)
73
181
74
-
## Development
182
+
- Extracted sequences as FASTA or FASTQ (one file per region, or one file for a single region)
183
+
- Optional: per-read anchor coordinates (TSV or JSONL), with confidence and ambiguity annotations
184
+
- Optional: QC summary JSON (`--qc-json`)
185
+
- Optional: ambiguous reads (`--write-ambiguous`) and skipped reads (`--write-skipped`) in separate files
0 commit comments