Skip to content

Commit cb6d61b

Browse files
committed
Update README
1 parent 753ce89 commit cb6d61b

1 file changed

Lines changed: 50 additions & 53 deletions

File tree

README.md

Lines changed: 50 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,27 @@
11
# eDNA_metabarcoding
2-
Note: This repo is mainly for the developers purpose, no guarantees of functionality or usefulness.
2+
This repo has been developed at the Molecular Genetics Laboratory of Pacific Biological Station (Fisheries and Oceans Canada) as part of the working group of Kristi Miller. This pipeline was developed for the purposes of analyzing eDNA and other metabarcoding datasets for the projects within this lab, and carries no guarantees of functionality or usefulness for other applications.
33

44
Dependencies:
55
`OBITools` http://metabarcoding.org/obitools/doc/welcome.html
6-
`MEGAN 6 (Community Edition)` https://ab.inf.uni-tuebingen.de/software/megan6
6+
`MEGAN 6 (CE)` https://ab.inf.uni-tuebingen.de/software/megan6
77
`blastn` https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
88
`cutadapt` http://cutadapt.readthedocs.io/en/stable/
9-
`R`
9+
`R` https://www.r-project.org/
1010

11-
To make obitools available everywhere, add the obitools binary and the obitools `/export/bin` folder to your $PATH
11+
To make obitools available everywhere, add both the obitools binary and the obitools `/export/bin` folder to your path.
1212

13-
Launch OBITools
13+
Before starting, launch OBITools from the terminal
1414
`obitools`
1515

16-
This pipeline can handle the following, and to find the appropriate pipeline, see Figure 1:
16+
This pipeline can handle the following datatypes. To find the appropriate pipeline, see Figure 1:
1717
* single-end (SE) or paired-end (PE) data
18-
* demultiplexed or multiplexed data
18+
* demultiplexed or multiplexed samples in fastq.gz format
1919
* multiple amplicons within a single sample file
2020

2121
![](00_archive/eDNA_metabarcoding_workflow.png)
22-
**Figure 1.** eDNA_metabarcoding workflow, showing the front end different options for either single-end (SE) or paired-end (PE) data, prior to the main analysis section. The grey box pipelines are variants derived from the standard multiplexed workflow (currently more stable).
22+
**Figure 1.** eDNA_metabarcoding workflow, showing the different options for the first part of the pipeline to analyze the datatypes listed above. The grey box pipelines are variants derived from the standard multiplexed workflow, which is currently more stable.
2323

24-
### Prepare Raw Data and Interpretation File
24+
### Prepare Raw Data
2525
Copy raw data into `02_raw_data`, decompress, then run fastqc to view quality.
2626

2727
```
@@ -32,39 +32,48 @@ fastqc -o 02_raw_data/fastqc_output 02_raw_data/*.fastq
3232
multiqc -o 02_raw_data/fastqc_output/ 02_raw_data/fastqc_output
3333
```
3434

35-
The interpretation file must be made for each sequencing lane or chip separately.
36-
Use `00_archive/interp_example.txt` as a template.
37-
**Importantly**, name interp file with input fastq name, replacing `R[1/2]_001.fastq` with `interp.txt`
35+
If you want to account for the number of reads in the input fastq files, producing some basic statistics on the reads (e.g. mean, sd, etc.), use the following:
36+
`./01_scripts/account_reads.sh`
37+
...followed by the Rscript run interactively:
38+
`./01_scripts/account_reads.R`
39+
For accounting reads when planning to do read merging or when having multiple amplicons in each fastq file, wait to do read accounting until later in the pipeline.
40+
41+
### Prepare Interpretation File
42+
The interpretation (interp) file must be made for each sequencing lane or chip separately.
43+
Use `00_archive/interp_example.txt` as a template to prepare the interp.
44+
**Importantly**, name interp file with input fastq name, replacing the section `R[1/2]_001.fastq` with `interp.txt`
3845
e.g. `Lib1_S1_L001_R1_001.fastq`, `Lib1_S1_L001_R2_001.fastq`, `Lib1_S1_L001_interp.txt`
3946

4047

41-
## Part 1A. Enter Pipeline - Multiplexed Data
42-
### 1A.1.a. Merge Paired-End Reads
43-
Paired-end data will undergo read merging first, run the following script:
48+
## Part 1A. Enter Pipeline - Multiplexed Samples
49+
This section treats fastq files that contain more than one sample.
50+
51+
### 1A.1.a. PE Data: Merge Reads
52+
Paired-end data will undergo read merging first using illuminapairedend with a minimum score of 40. Output will be per file in the `03_merged` folder.
4453
`01_scripts/01a_read_merging.sh`
45-
(in brief: `illuminapairedend --score-min=40 -r 02_raw_data/*R2.fq 02_raw_data/*R1.fq > 03_merged/*merged.fq`)
46-
Retain only the merged (aligned) reads:
47-
`01_scripts/01b_retain_aligned.sh`
48-
(in brief: `obigrep -p 'mode!="joined"' 03_merged/*merged.fq > 03_merged/*ali.fq`)
4954

50-
Audit: how many reads remain after keeping only merged?
51-
`grep -cE '^\+$' 03_merged/*ali.fq`
55+
Retain only the merged (also termed 'aligned') reads using obigrep:
56+
`01_scripts/01b_retain_aligned.sh`
5257

58+
#### Post-Merge Read Accounting
59+
The following script will account reads and merges:
60+
`01_scripts/check_merging.sh`
61+
...followed by Rscript interactively:
62+
`01_scripts/read_and_alignment_summary.R`
5363

54-
### 1A.1.b. Mimic PE Step For SE Samples
55-
Single-end data, to catch up w/ paired-end, run the following command:
64+
### 1A.1.b. SE Data: Rename Files
65+
If you have SE data, this is necessary to match the file names from PE data for the pipeline, run for each fq file:
5666
`cp -l 02_raw_data/your_file_R1_001.fastq 03_merged/your_file_ali.fq`
5767

5868

5969
### 1A.2. Separate Individuals
60-
Use ngsfilter with the interp file(s) to demultiplex samples out of the `*.ali.fq` file(s).
70+
Use ngsfilter with the interp file(s) to demultiplex samples out of the `*.ali.fq` file(s). Results will be separated by sample and placed in `04_samples`.
6171
`./01_scripts/02_ngsfilter.sh`
62-
(in brief: `ngsfilter -t 00_archive/*_interp.txt -u unidenfied.fq 03_merged/*ali.fq > 04_samples/*_ali_assi.fq`)
6372

6473
Audit: how many reads were assigned to a sample?
6574
`for i in $(ls 04_samples/*_ali_assi.fq) ; do echo $i ; grep -cE '^\+$' $i ; done`
6675

67-
Each output file now should be annotated with a sample ID in the read accession header, and if so, one can concatenate all files together now, as follows:
76+
Since all files should not be annotated with a sample ID in the fasta record header, and so one can concatenate all files together:
6877
```
6978
mkdir 04_samples/sep_indiv
7079
mv 04_samples/*.fq 04_samples/sep_indiv
@@ -73,54 +82,42 @@ cat 04_samples/sep_indiv/*_ali_assi.fq > 04_samples/all_files_ali_assi.fq
7382

7483
Move on to [Part 2](#part-2-main-analysis).
7584

76-
## Part 1B. Enter Pipeline - De-Multiplexed Data
77-
This section is the preparation of input data section if your data comes de-multiplexed.
78-
Depending on the data type (see Figure 1), the steps taken here will vary. See Variant A and Variant B.
85+
## Part 1B. Enter Pipeline - De-Multiplexed Data Variants
86+
This section treats fastq files that are already de-multiplexed.
87+
Depending on the data type (see Figure 1), the steps taken here will vary (also see Variant A and Variant B).
7988

8089
### Variant A. De-multiplexed single-amplicon (SE and PE)
8190
### 1B.0. Cutadapt
82-
As the barcodes are not used to de-multiplex in this case, the primer sequence still must be removed:
91+
Barcodes are not used to de-multiplex here, but the primer sequence still must be removed. Set the primer variables in the following script and run it to produce your fastq files without the primer renamed as 'yourfile_noprime.fastq'.
8392
`01_scripts/00_cutadapt_primer_seq.sh`
84-
(in brief: `cutadapt -g adapt1 -G adapt2 -o 02_raw_data/*R1_001_noprime.fastq -p 02_raw_data/*R2_001_noprime.fastq 02_raw_data/*R1_001.fastq.gz 02_raw_data/*R1_001_fastq.gz`)
8593

86-
### 1B.1.a. Merge Paired-End Reads
87-
See Part 1A, 01a (merge Paired-End Reads for details, but the following is for those without primers:
94+
### 1B.1.a. PE Reads: Merge
95+
Similar to above, merge PE reads on the primer-removed fastq files:
8896
`01_scripts/01a_read_merging_noprime.sh`
8997
`01_scripts/01b_retain_aligned.sh`
9098

91-
Because we did not use ngsfilter, we need to annotate each read accession with sample IDs, then the read files can be combined into a single file.
99+
As ngsfilter was not used here, before combining multiple samples together, we need to annotate each read with a sample ID. This is conducted with the following script that will use obiannotate to name records, then combine all into `04_samples/*merged_data_assi.fq`
92100
`01_scripts/obiannotate_ident.sh`
93-
(in brief: `obiannotate -S sample:$i 04_samples/*.fq > 04_samples/*_sannot.fq`)
94-
(in brief: `cat *datatype_sannot.fq > 04_samples/datatype_merged_data_assi.fq`)
95101

102+
Now move on to [Part 2](#part-2-main-analysis).
96103

97-
### Accounting for reads
98-
Run the script `01_scripts/check_merging.sh` then follow up with the Rscript `01_scripts/read_and_alignment_summary.R` to obtain summary statistics on how many reads were initially present in raw data and how many were merged.
99-
100-
101-
Move on to [Part 2](#part-2-main-analysis).
102-
103-
### 1B.1.b. Use ngsfilter to Enter Pipeline with Single-End Reads With Unidentified Reads
104-
Single-end data, enter the obitools pipeline via a failed run of ngsfilter and take all of the 'unidentied reads' per sample as your sample's reads:
104+
### 1B.1.b. SE Reads: Prepare Data for Obitools
105+
To get into the pipeline, with reads formatted for obitools, use a pass through ngsfilter designed to fail all reads, then take all unidentified reads for your sample and move forward.
105106
`01_scripts/02_ngsfilter_SE_exp_unident.sh`
106-
(in brief: `ngsfilter -t $interp -u 04_samples/*_unidentified.fq 02_raw_data/*_001.fastq > 04_samples/*_assi.fq`)
107-
As noted, all the unidentified files will have full data in them, and the assigned are empty.
107+
All the data should be in 'unidentified' files per sample, and all 'assigned' files should be empty here.
108108

109-
As above, we need to annotate each read accession with a sample ID (no ngsfilter success), then combine all files into one:
109+
Each read will then be annotated with a sample ID, then all files combined together:
110110
`01_scripts/obiannotate_unident.sh`
111-
(in brief: `obiannotate -S sample:$i 04_samples/*L001_Rq_unidentified.fq > 04_samples/*_sannot.fq`)
112-
(in brief: `cat 04_samples/*_sannot.fq > 04_samples/merged_data_assi.fq`)
113111

114-
To improve identification of identical amplicons, cut the SE data to a uniform size using cutadapt:
112+
To identify identical amplicons, cut SE data down to a uniform size (230 bp) with cutadapt:
115113
`cutadapt --length 230 -o 04b_annotated_samples/merged_data_assi_230.fq 04b_annotated_samples/merged_data_assi.fq`
116114
Move on to [Part 2](#part-2-main-analysis).
117115

118116
### Variant B. De-multiplexed multiple-amplicon (SE option only)
119-
Single-end data, enter the obitools pipeline by using ngsfilter with the primer sequence split into the first six basepairs as a fake 'barcode' and the last sequence of the primer as the primer sequence. This way you can, per sample, de-multiplex your data by amplicon type.
117+
SE data will enter the obitools pipeline by using ngsfilter with the primer sequence for the sample split into the first six basepairs as a fake 'barcode' and the last sequence of the primer as the primer sequence. This way you can, per sample, de-multiplex your data by amplicon type.
120118
`01_scripts/02_ngsfilter.sh`
121-
(as described above)
122119

123-
Then per sample, the data can be split into the two amplicon types (it is currently just named in the accession):
120+
Per sample, the data can be split into the two amplicon types (it is currently just named in the accession):
124121
`01_scripts/00b_split_by_type.sh`
125122
Note here that this script will need to be edited for your use. Currently uses grep to take the following three lines after the identifier of interest.
126123

0 commit comments

Comments
 (0)