Skip to content

Commit 630875c

Browse files
committed
Update README; fix --help description
1 parent 0f5cad7 commit 630875c

2 files changed

Lines changed: 152 additions & 39 deletions

File tree

README.md

Lines changed: 151 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,99 +5,212 @@ ITS subregion extraction for fungal metabarcoding at long-read scale.
55
As long-read amplicon sequencing (Oxford Nanopore and PacBio HiFi) becomes routine, extracting ITS subregions (ITS1, 5.8S, ITS2, full ITS) reliably at scale can become a throughput and robustness bottleneck. ITSxRust is a Rust-based ITS extractor that follows the standard approach of locating conserved ribosomal flanks using profile-HMMs (via HMMER), while adding long-read–oriented features for reproducible, high-throughput processing.
66

77
## Features
8-
- HMMER/profile-HMM–based detection of conserved ribosomal flanks to extract ITS subregions
9-
- Supports long-read workloads (ONT / HiFi) with built-in parameter presets
10-
- Optional dereplication to reduce redundant HMMER searches
11-
- Partial-chain fallback: recover subregions using two-anchor pairs when a full four-anchor chain is unavailable
12-
- Structured failure diagnostics and QC summaries to help understand why reads were skipped or partially recovered
13-
- Works with FASTA and FASTQ inputs
148

15-
## Install
9+
- **HMMER-based boundary detection** — locates conserved ribosomal flanks (SSU, 5.8S, LSU) using `nhmmer` profile-HMM searches to delimit ITS subregions
10+
- **Platform presets**`--preset ont` (tolerant E-values, wider length constraints) and `--preset hifi` (strict thresholds); explicit flags override any preset value
11+
- **Partial-chain fallback** — when the full 4-anchor chain (SSU→5.8S_start→5.8S_end→LSU) is unavailable, recovers subregions from 2-anchor pairs (e.g., SSU+5.8S_start for ITS1)
12+
- **Confidence classification** — each extracted read is labelled `confident`, `ambiguous`, or `partial` based on per-anchor score/E-value thresholds; ambiguous reads can be diverted to a separate file with `--write-ambiguous`
13+
- **Exact dereplication**`--derep` hashes identical sequences and searches only unique representatives, projecting results back to duplicates
14+
- **Structured QC output**`--qc-json` emits a per-sample JSON summary (read counts, skip-reason breakdown, parameters) suitable for MultiQC ingestion; `--anchors-tsv` / `--anchors-jsonl` export per-read anchor coordinates and confidence labels
15+
- **Multi-region output** — extract ITS1, ITS2, full ITS, or all three simultaneously (`--region all`)
16+
- **FASTA and FASTQ support** — reads gzipped or uncompressed inputs; preserves quality scores when outputting FASTQ
17+
18+
## Quick start
19+
20+
```bash
21+
itsxrust extract \
22+
--input reads.fastq.gz \
23+
--hmm F.hmm \
24+
--output its2_extracted.fasta \
25+
--region its2 \
26+
--preset ont \
27+
--hmmer-cpu 8
28+
```
29+
30+
Expected output:
31+
32+
```
33+
Using preset: ont
34+
Params: in=Fastq out=Fasta region=its2 max_per_anchor=10 derep=false inc_e=1.0e-3 ...
35+
tblout hits parsed: 142389 | anchor hits: 131052 | stored(topK): 98412 | reads w/anchor hits: 48231
36+
Reads with computed bounds: 45102 (full-chain: 40811, confident: 38500, ambiguous: 2311, partial: 4291)
37+
Wrote output: its2_extracted.fasta
38+
Kept: 45102 (partial: 4291) Ambiguous (separate): 0 Skipped: 3129
39+
```
40+
41+
To extract ITS1, ITS2, and full ITS simultaneously, use `--region all`. In this mode, `--output` is treated as a prefix:
42+
43+
```bash
44+
itsxrust extract \
45+
--input reads.fastq.gz \
46+
--hmm F.hmm \
47+
--output results/sample1 \
48+
--region all \
49+
--preset ont \
50+
--hmmer-cpu 8
51+
```
52+
53+
This produces `results/sample1.its1.fasta`, `results/sample1.its2.fasta`, and `results/sample1.full.fasta`.
1654

17-
### Prebuilt binaries (recommended)
18-
Download the appropriate binary for your OS from GitHub Releases:
55+
## Install
1956

20-
- GitHub → Releases → `v0.1.0`
57+
### Prebuilt binaries
2158

22-
Then:
59+
Download the binary for your platform from [GitHub Releases](https://github.com/ayobi/ITSxRust/releases), then:
2360

2461
```bash
2562
chmod +x itsxrust
2663
./itsxrust --help
2764
```
2865

29-
### From source
30-
Requires Rust (stable) and Cargo.
66+
### Docker
67+
68+
The Docker image bundles HMMER, so `nhmmer` is available out of the box:
3169

3270
```bash
33-
cargo build --release
34-
./target/release/itsxrust --help
71+
docker run --rm -v $(pwd):/data ghcr.io/ayobi/itsxrust:latest \
72+
extract --input /data/reads.fastq.gz --hmm /data/F.hmm \
73+
--output /data/its2_extracted.fasta --region its2 --preset ont
3574
```
3675

37-
Or install into your Cargo bin dir:
76+
### From source
77+
78+
Requires Rust (stable, edition 2024) and Cargo:
3879

3980
```bash
4081
cargo install --path .
4182
itsxrust --help
4283
```
4384

4485
### Dependency: HMMER
45-
ITSxRust coordinates HMMER searches (e.g., `hmmscan`) to locate ribosomal flanks. Ensure HMMER is available in your environment for typical extraction workflows.
4686

47-
## Usage
87+
ITSxRust calls `nhmmer` to search profile-HMMs against input sequences. Install HMMER 3.x and ensure `nhmmer` is on your PATH:
88+
89+
```bash
90+
conda install -c bioconda hmmer
91+
```
92+
93+
The Docker image includes HMMER, so no separate installation is needed when using the container.
4894

49-
Help:
95+
### HMM profiles
96+
97+
ITSxRust uses the same fungal HMM profiles as [ITSx](https://microbiology.se/software/itsx/). The file is typically called `F.hmm` (for fungi) and is distributed with ITSx. After installing ITSx, find it with:
98+
99+
```bash
100+
find $(dirname $(which ITSx))/../ -name "F.hmm" 2>/dev/null
101+
```
102+
103+
Or download the ITSx package and extract the HMM files from the `ITSx_db/HMMs/` directory.
104+
105+
## Usage
50106

51107
```bash
52108
itsxrust --help
53109
itsxrust extract --help
54110
```
55111

56-
Example extraction:
112+
### Key options
113+
114+
| Flag | Description | Default |
115+
|---|---|---|
116+
| `--input` | Input FASTA/FASTQ (`.gz` OK) | required |
117+
| `--hmm` | HMM profile file | required (unless `--tblout-existing`) |
118+
| `--output` | Output file (single region) or prefix (`--region all`) | required |
119+
| `--region` | `its1`, `its2`, `full`, or `all` | `full` |
120+
| `--preset` | `ont` or `hifi` — sets E-value, constraints, confidence thresholds ||
121+
| `--hmmer-cpu` | Threads for nhmmer | 8 |
122+
| `--inc-e` | E-value inclusion threshold | 1e-5 (ont: 1e-3, hifi: 1e-10) |
123+
| `--derep` | Exact dereplication before HMMER search | false |
124+
| `--input-format` | `auto`, `fasta`, or `fastq` | `auto` |
125+
| `--output-format` | `auto`, `fasta`, or `fastq` | `auto` |
126+
| `--tblout-existing` | Reuse a previous nhmmer `--tblout` file (skips nhmmer) ||
127+
| `--anchors-tsv` | Write per-read anchor coordinates as TSV ||
128+
| `--anchors-jsonl` | Write per-read anchor coordinates as JSONL ||
129+
| `--qc-json` | Write per-sample QC summary as JSON ||
130+
| `--write-ambiguous` | Divert ambiguous reads to a separate file ||
131+
| `--write-skipped` | Write skipped reads to a separate file ||
132+
| `--explain N` | Print skip reasons for the first N skipped reads | 0 |
133+
134+
### Platform presets
135+
136+
Presets bundle sensible defaults for each platform. Explicit flags always override preset values.
137+
138+
| Parameter | No preset | `--preset ont` | `--preset hifi` |
139+
|---|---|---|---|
140+
| `--inc-e` | 1e-5 | 1e-3 | 1e-10 |
141+
| `--max-per-anchor` | 8 | 10 | 6 |
142+
| `--min-its1` / `--max-its1` | 50 / 1500 | 30 / 1800 | 50 / 1500 |
143+
| `--min-its2` / `--max-its2` | 50 / 2000 | 30 / 2500 | 50 / 2000 |
144+
| `--min-full` / `--max-full` | 150 / 4000 | 100 / 5000 | 150 / 4000 |
145+
| `--min-anchor-score` | 20 | 15 | 30 |
146+
| `--max-anchor-evalue` | 1e-4 | 1e-3 | 1e-8 |
147+
148+
### Diagnostics
149+
150+
Use `--explain` to see why reads are being skipped:
57151

58152
```bash
59-
itsxrust extract --input reads.fastq.gz --hmm path/to/F.hmm --region its2 --output out_dir/ --hmmer-cpu 8
153+
itsxrust extract --input reads.fq --hmm F.hmm --output out.fasta \
154+
--region full --explain 10
155+
```
156+
157+
```
158+
SKIP read_42: missing anchors: LSU_start
159+
SKIP read_87: anchors present but no valid chain under constraints
60160
```
61161

62-
Presets (ONT / HiFi) are available via the CLI (see `itsxrust extract --help`).
162+
Use `--qc-json` for a machine-readable summary of the full run:
63163

64-
## Inputs / Outputs
164+
```bash
165+
itsxrust extract --input reads.fq --hmm F.hmm --output out.fasta \
166+
--region all --preset ont --qc-json qc_summary.json
167+
```
168+
169+
The JSON includes total reads, kept/skipped counts with reason-code breakdowns, confidence classification counts, dereplication stats (if `--derep`), and the effective parameters used.
170+
171+
Use `--anchors-tsv` or `--anchors-jsonl` to export per-read anchor hit coordinates, confidence labels, and ambiguous-reason annotations for reads that produced a valid chain.
172+
173+
## Inputs / outputs
65174

66175
**Inputs**
67-
- FASTA / FASTQ (optionally gzipped)
68-
- HMM model file (profile-HMMs for ribosomal flanks)
176+
177+
- FASTA or FASTQ, optionally gzipped
178+
- HMM profile file (e.g., `F.hmm` from ITSx)
69179

70180
**Outputs**
71-
- FASTA of extracted regions (ITS1 / ITS2 / full)
72-
- Optional anchor/boundary outputs (TSV/JSONL) and QC summaries (if enabled)
73181

74-
## Development
182+
- Extracted sequences as FASTA or FASTQ (one file per region, or one file for a single region)
183+
- Optional: per-read anchor coordinates (TSV or JSONL), with confidence and ambiguity annotations
184+
- Optional: QC summary JSON (`--qc-json`)
185+
- Optional: ambiguous reads (`--write-ambiguous`) and skipped reads (`--write-skipped`) in separate files
75186

76-
Run checks:
187+
## Development
77188

78189
```bash
79190
cargo fmt
80191
cargo clippy --all-targets --all-features -- -D warnings
81192
cargo test
82193
```
83194

84-
Benchmarks and simulation scripts live in `bench/`.
195+
Benchmarking and simulation scripts live in `bench/`. See `bench/sim/README.md` for the simulation-based evaluation pipeline.
85196

86197
## Project layout
87-
- `src/` Rust source
88-
- `tests/` integration tests
89-
- `bench/` benchmarking + simulation harness
90-
- `testdata/` small fixtures used for tests
91-
- `manuscript/figures/` final figure outputs
92198

93-
Large datasets and generated outputs should stay untracked.
199+
```
200+
src/ Rust source (main, select, tblout, trim, preset, derep, report, seq, fasta, fastq, hmmer)
201+
tests/ Integration tests
202+
bench/ Benchmarking + simulation harness
203+
```
94204

95205
## Roadmap
96-
- Container images (GHCR)
206+
97207
- Bioconda recipe
208+
- In-process HMM bindings (replace nhmmer subprocess)
98209

99210
## License
211+
100212
MIT (see `LICENSE`).
101213

102214
## Citation
103-
If you use ITSxRust, please cite the repository metadata via GitHub’s “Cite this repository” button (powered by `CITATION.cff`).
215+
216+
If you use ITSxRust, please cite the repository via GitHub's "Cite this repository" button (powered by `CITATION.cff`).

src/main.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ pub enum Preset {
4242
}
4343

4444
#[derive(Parser, Debug)]
45-
#[command(name = "itsxrust", version, about = "ONT ITS region extractor (v1)")]
45+
#[command(name = "itsxrust", version, about = "ITS region extractor for long-read amplicon sequencing")]
4646
struct Cli {
4747
#[command(subcommand)]
4848
command: Commands,

0 commit comments

Comments
 (0)