Skip to content

AgResearch/smk-metabarcoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

smk-metabarcoding

Snakemake pipeline for 16S/ITS metabarcoding which is extensible to other amplicons.

Reads may contain a mix of 16S and ITS amplicons pooled in the same library. Reads are primer-split with cutadapt, then each detected amplicon is processed independently through QIIME2 + DADA2 to count matrices with taxonomy.

Reference databases (pretrained classifiers)

Taxonomy uses three pretrained, prebuilt QIIME2 classifiers. 16S reads are classified twice — once with SILVA and once with GTDB:

Amplicon Classifier QIIME2 / scikit-learn / archive Classify runs on
16S SILVA 138.2 full-length SSU NR99, uniform (arb-silva) 2025.7 / 1.4.2 / archive 7.0 2026.1 conda env
16S GTDB r226.0 full-length (Zenodo 19686584) 2026.4 / 1.7.1 / archive 7.1 2026.4 conda env
ITS UNITE v10.0 dynamic / all-eukaryotes (unite-train) 2026.4 / 1.7.1 / archive 7.x 2026.4 conda env

Each pretrained classifier must be loaded by an env that can both run its scikit-learn version and read its .qza archive format, so the pipeline uses three environments (see workflow/amplicon-16s-ITS-QIIME2-DADA2.smk):

  • QIIME2/2024.10-amplicon module (archive 6) — shared 16S import/denoise. Its archive-6 outputs are readable by every newer env.
  • 2026.1 conda env (sklearn 1.4.2, archive 7.0) — SILVA classify only. The 2024.10 module cannot read the SILVA classifier's archive 7.0, and the 2026.4 env cannot load a 1.4.2 classifier; 2026.1 satisfies both.
  • 2026.4 conda env (sklearn 1.7.1, archive 7.1) — GTDB classify, all 16S collapse/export, and the entire ITS chain.

Citation: arb-silva requires citing the SILVA classifier DOI. It is saved in resources/16s/SILVA138.2_SSURef_NR99_uniform_classifier_full-length.qza.doi. Also cite GTDB (https://doi.org/10.1093/nar/gkab776) and UNITE.

One-time setup

# 1. Create the QIIME2 conda envs (Snakemake also builds its own copies under
#    .snakemake/conda from the same files; these named envs are handy for manual
#    `qiime tools peek` checks). The 2024.10 env is the existing module.
conda env create -n qiime2-amplicon-2026.4 \
  --file workflow/envs/qiime2-amplicon-2026.4.yaml
conda env create -n qiime2-amplicon-2026.1 \
  --file workflow/envs/qiime2-amplicon-2026.1.yaml   # SILVA classify (sklearn 1.4.2)

# 2. Download the three pretrained classifiers into resources/16s and resources/ITS
bash resources/download_classifiers.sh

# 3. gzip-stage the NZGL00969 runs (uncompressed .fastq in a read-only archive)
#    into a writable staging dir, one subdir per run.
bash resources/stage_fastq_runs.sh

Classifier paths and primers are configured in config/pipeline_config.yaml. Primers were confirmed with primer_check.sh (16S V1-V2 = 27F/338R; ITS2 = ITS3_KYO2/ITS4).

Running

The pipeline processes one run per invocation. Run it once per staged run, overriding AMPLICON_RUN_PATH on the CLI:

module load snakemake
STAGING=/mnt/gpfs/persist/projects/2026_ssif_advancing_soil_microbiome/staging
for runid in \
  150325_M00933_0072_000000000-AE9PC \
  150325_M02810_0078_000000000-AE9PL \
  150330_M02810_0080_000000000-ADKJU \
  150417_M02810_0085_000000000-AEBHB ; do
  snakemake \
    --profile config/slurm_profiles/eri \
    --snakefile workflow/amplicon-16s-ITS-QIIME2-DADA2.smk \
    --config AMPLICON_RUN_PATH="$STAGING/$runid"
done

Outputs

Per run, under results/<RUN>/:

  • 16s/silva/, 16s/gtdb/, ITS/ — each with feature_table.tsv, feature_table_with_taxonomy.tsv, rep_seqs.fasta, taxonomy.tsv, taxonomy.formatted.tsv, denoising_stats.tsv, collapsed_species.tsv, collapsed_species.formatted.tsv.
  • 16s/02_denoise/ — the DADA2 table/rep-seqs/stats shared by both 16S databases.
  • multiqc/<RUN>.multiqc.html — combined FastQC report. Sample names are prefixed with their source directory (raw, 16s, ITS) so the raw, 16S-trimmed and ITS-trimmed view of each sample stay distinct; report title/intro come from config/multiqc_config.yaml. The report header records the 16S/ITS forward and reverse primers this run was configured with (injected from the run's config).

Amplicons below AMP_MIN_READS_THRESHOLD total reads are skipped automatically.

About

Snakemake pipeline for 16s/ITS metabarcoding which is extensible to other amplicons

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors