Snakemake pipeline for 16S/ITS metabarcoding which is extensible to other amplicons.
Reads may contain a mix of 16S and ITS amplicons pooled in the same library. Reads are primer-split with cutadapt, then each detected amplicon is processed independently through QIIME2 + DADA2 to count matrices with taxonomy.
Taxonomy uses three pretrained, prebuilt QIIME2 classifiers. 16S reads are classified twice — once with SILVA and once with GTDB:
| Amplicon | Classifier | QIIME2 / scikit-learn / archive | Classify runs on |
|---|---|---|---|
| 16S | SILVA 138.2 full-length SSU NR99, uniform (arb-silva) | 2025.7 / 1.4.2 / archive 7.0 | 2026.1 conda env |
| 16S | GTDB r226.0 full-length (Zenodo 19686584) | 2026.4 / 1.7.1 / archive 7.1 | 2026.4 conda env |
| ITS | UNITE v10.0 dynamic / all-eukaryotes (unite-train) | 2026.4 / 1.7.1 / archive 7.x | 2026.4 conda env |
Each pretrained classifier must be loaded by an env that can both run its
scikit-learn version and read its .qza archive format, so the pipeline uses
three environments (see workflow/amplicon-16s-ITS-QIIME2-DADA2.smk):
QIIME2/2024.10-ampliconmodule (archive 6) — shared 16S import/denoise. Its archive-6 outputs are readable by every newer env.- 2026.1 conda env (sklearn 1.4.2, archive 7.0) — SILVA classify only. The 2024.10 module cannot read the SILVA classifier's archive 7.0, and the 2026.4 env cannot load a 1.4.2 classifier; 2026.1 satisfies both.
- 2026.4 conda env (sklearn 1.7.1, archive 7.1) — GTDB classify, all 16S collapse/export, and the entire ITS chain.
Citation: arb-silva requires citing the SILVA classifier DOI. It is saved in
resources/16s/SILVA138.2_SSURef_NR99_uniform_classifier_full-length.qza.doi. Also cite GTDB (https://doi.org/10.1093/nar/gkab776) and UNITE.
# 1. Create the QIIME2 conda envs (Snakemake also builds its own copies under
# .snakemake/conda from the same files; these named envs are handy for manual
# `qiime tools peek` checks). The 2024.10 env is the existing module.
conda env create -n qiime2-amplicon-2026.4 \
--file workflow/envs/qiime2-amplicon-2026.4.yaml
conda env create -n qiime2-amplicon-2026.1 \
--file workflow/envs/qiime2-amplicon-2026.1.yaml # SILVA classify (sklearn 1.4.2)
# 2. Download the three pretrained classifiers into resources/16s and resources/ITS
bash resources/download_classifiers.sh
# 3. gzip-stage the NZGL00969 runs (uncompressed .fastq in a read-only archive)
# into a writable staging dir, one subdir per run.
bash resources/stage_fastq_runs.shClassifier paths and primers are configured in
config/pipeline_config.yaml. Primers were confirmed
with primer_check.sh (16S V1-V2 = 27F/338R; ITS2 = ITS3_KYO2/ITS4).
The pipeline processes one run per invocation. Run it once per staged run,
overriding AMPLICON_RUN_PATH on the CLI:
module load snakemake
STAGING=/mnt/gpfs/persist/projects/2026_ssif_advancing_soil_microbiome/staging
for runid in \
150325_M00933_0072_000000000-AE9PC \
150325_M02810_0078_000000000-AE9PL \
150330_M02810_0080_000000000-ADKJU \
150417_M02810_0085_000000000-AEBHB ; do
snakemake \
--profile config/slurm_profiles/eri \
--snakefile workflow/amplicon-16s-ITS-QIIME2-DADA2.smk \
--config AMPLICON_RUN_PATH="$STAGING/$runid"
donePer run, under results/<RUN>/:
16s/silva/,16s/gtdb/,ITS/— each withfeature_table.tsv,feature_table_with_taxonomy.tsv,rep_seqs.fasta,taxonomy.tsv,taxonomy.formatted.tsv,denoising_stats.tsv,collapsed_species.tsv,collapsed_species.formatted.tsv.16s/02_denoise/— the DADA2 table/rep-seqs/stats shared by both 16S databases.multiqc/<RUN>.multiqc.html— combined FastQC report. Sample names are prefixed with their source directory (raw,16s,ITS) so the raw, 16S-trimmed and ITS-trimmed view of each sample stay distinct; report title/intro come fromconfig/multiqc_config.yaml. The report header records the 16S/ITS forward and reverse primers this run was configured with (injected from the run's config).
Amplicons below AMP_MIN_READS_THRESHOLD total reads are skipped automatically.