You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-53Lines changed: 50 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,27 +1,27 @@
1
1
# eDNA_metabarcoding
2
-
Note: This repo is mainly for the developers purpose, no guarantees of functionality or usefulness.
2
+
This repo has been developed at the Molecular Genetics Laboratory of Pacific Biological Station (Fisheries and Oceans Canada) as part of the working group of Kristi Miller. This pipeline was developed for the purposes of analyzing eDNA and other metabarcoding datasets for the projects within this lab, and carries no guarantees of functionality or usefulness for other applications.
To make obitools available everywhere, add the obitools binary and the obitools `/export/bin` folder to your $PATH
11
+
To make obitools available everywhere, add both the obitools binary and the obitools `/export/bin` folder to your path.
12
12
13
-
Launch OBITools
13
+
Before starting, launch OBITools from the terminal
14
14
`obitools`
15
15
16
-
This pipeline can handle the following, and to find the appropriate pipeline, see Figure 1:
16
+
This pipeline can handle the following datatypes. To find the appropriate pipeline, see Figure 1:
17
17
* single-end (SE) or paired-end (PE) data
18
-
* demultiplexed or multiplexed data
18
+
* demultiplexed or multiplexed samples in fastq.gz format
19
19
* multiple amplicons within a single sample file
20
20
21
21

22
-
**Figure 1.** eDNA_metabarcoding workflow, showing the front end different options for either single-end (SE) or paired-end (PE) data, prior to the main analysis section. The grey box pipelines are variants derived from the standard multiplexed workflow (currently more stable).
22
+
**Figure 1.** eDNA_metabarcoding workflow, showing the different options for the first part of the pipeline to analyze the datatypes listed above. The grey box pipelines are variants derived from the standard multiplexed workflow, which is currently more stable.
23
23
24
-
### Prepare Raw Data and Interpretation File
24
+
### Prepare Raw Data
25
25
Copy raw data into `02_raw_data`, decompress, then run fastqc to view quality.
The interpretation file must be made for each sequencing lane or chip separately.
36
-
Use `00_archive/interp_example.txt` as a template.
37
-
**Importantly**, name interp file with input fastq name, replacing `R[1/2]_001.fastq` with `interp.txt`
35
+
If you want to account for the number of reads in the input fastq files, producing some basic statistics on the reads (e.g. mean, sd, etc.), use the following:
36
+
`./01_scripts/account_reads.sh`
37
+
...followed by the Rscript run interactively:
38
+
`./01_scripts/account_reads.R`
39
+
For accounting reads when planning to do read merging or when having multiple amplicons in each fastq file, wait to do read accounting until later in the pipeline.
40
+
41
+
### Prepare Interpretation File
42
+
The interpretation (interp) file must be made for each sequencing lane or chip separately.
43
+
Use `00_archive/interp_example.txt` as a template to prepare the interp.
44
+
**Importantly**, name interp file with input fastq name, replacing the section `R[1/2]_001.fastq` with `interp.txt`
38
45
e.g. `Lib1_S1_L001_R1_001.fastq`, `Lib1_S1_L001_R2_001.fastq`, `Lib1_S1_L001_interp.txt`
39
46
40
47
41
-
## Part 1A. Enter Pipeline - Multiplexed Data
42
-
### 1A.1.a. Merge Paired-End Reads
43
-
Paired-end data will undergo read merging first, run the following script:
48
+
## Part 1A. Enter Pipeline - Multiplexed Samples
49
+
This section treats fastq files that contain more than one sample.
50
+
51
+
### 1A.1.a. PE Data: Merge Reads
52
+
Paired-end data will undergo read merging first using illuminapairedend with a minimum score of 40. Output will be per file in the `03_merged` folder.
44
53
`01_scripts/01a_read_merging.sh`
45
-
(in brief: `illuminapairedend --score-min=40 -r 02_raw_data/*R2.fq 02_raw_data/*R1.fq > 03_merged/*merged.fq`)
46
-
Retain only the merged (aligned) reads:
47
-
`01_scripts/01b_retain_aligned.sh`
48
-
(in brief: `obigrep -p 'mode!="joined"' 03_merged/*merged.fq > 03_merged/*ali.fq`)
49
54
50
-
Audit: how many reads remain after keeping only merged?
51
-
`grep -cE '^\+$' 03_merged/*ali.fq`
55
+
Retain only the merged (also termed 'aligned') reads using obigrep:
56
+
`01_scripts/01b_retain_aligned.sh`
52
57
58
+
#### Post-Merge Read Accounting
59
+
The following script will account reads and merges:
60
+
`01_scripts/check_merging.sh`
61
+
...followed by Rscript interactively:
62
+
`01_scripts/read_and_alignment_summary.R`
53
63
54
-
### 1A.1.b. Mimic PE Step For SE Samples
55
-
Single-end data, to catch up w/ paired-end, run the following command:
64
+
### 1A.1.b. SE Data: Rename Files
65
+
If you have SE data, this is necessary to match the file names from PE data for the pipeline, run for each fq file:
Use ngsfilter with the interp file(s) to demultiplex samples out of the `*.ali.fq` file(s).
70
+
Use ngsfilter with the interp file(s) to demultiplex samples out of the `*.ali.fq` file(s). Results will be separated by sample and placed in `04_samples`.
61
71
`./01_scripts/02_ngsfilter.sh`
62
-
(in brief: `ngsfilter -t 00_archive/*_interp.txt -u unidenfied.fq 03_merged/*ali.fq > 04_samples/*_ali_assi.fq`)
63
72
64
73
Audit: how many reads were assigned to a sample?
65
74
`for i in $(ls 04_samples/*_ali_assi.fq) ; do echo $i ; grep -cE '^\+$' $i ; done`
66
75
67
-
Each output file now should be annotated with a sample ID in the read accession header, and if so, one can concatenate all files together now, as follows:
76
+
Since all files should not be annotated with a sample ID in the fasta record header, and so one can concatenate all files together:
This section is the preparation of input data section if your data comes de-multiplexed.
78
-
Depending on the data type (see Figure 1), the steps taken here will vary. See Variant A and Variant B.
85
+
## Part 1B. Enter Pipeline - De-Multiplexed Data Variants
86
+
This section treats fastq files that are already de-multiplexed.
87
+
Depending on the data type (see Figure 1), the steps taken here will vary (also see Variant A and Variant B).
79
88
80
89
### Variant A. De-multiplexed single-amplicon (SE and PE)
81
90
### 1B.0. Cutadapt
82
-
As the barcodes are not used to de-multiplex in this case, the primer sequence still must be removed:
91
+
Barcodes are not used to de-multiplex here, but the primer sequence still must be removed. Set the primer variables in the following script and run it to produce your fastq files without the primer renamed as 'yourfile_noprime.fastq'.
See Part 1A, 01a (merge Paired-End Reads for details, but the following is for those without primers:
94
+
### 1B.1.a. PE Reads: Merge
95
+
Similar to above, merge PE reads on the primer-removed fastq files:
88
96
`01_scripts/01a_read_merging_noprime.sh`
89
97
`01_scripts/01b_retain_aligned.sh`
90
98
91
-
Because we did not use ngsfilter, we need to annotate each read accession with sample IDs, then the read files can be combined into a single file.
99
+
As ngsfilter was not used here, before combining multiple samples together, we need to annotate each read with a sample ID. This is conducted with the following script that will use obiannotate to name records, then combine all into `04_samples/*merged_data_assi.fq`
92
100
`01_scripts/obiannotate_ident.sh`
93
-
(in brief: `obiannotate -S sample:$i 04_samples/*.fq > 04_samples/*_sannot.fq`)
94
-
(in brief: `cat *datatype_sannot.fq > 04_samples/datatype_merged_data_assi.fq`)
95
101
102
+
Now move on to [Part 2](#part-2-main-analysis).
96
103
97
-
### Accounting for reads
98
-
Run the script `01_scripts/check_merging.sh` then follow up with the Rscript `01_scripts/read_and_alignment_summary.R` to obtain summary statistics on how many reads were initially present in raw data and how many were merged.
99
-
100
-
101
-
Move on to [Part 2](#part-2-main-analysis).
102
-
103
-
### 1B.1.b. Use ngsfilter to Enter Pipeline with Single-End Reads With Unidentified Reads
104
-
Single-end data, enter the obitools pipeline via a failed run of ngsfilter and take all of the 'unidentied reads' per sample as your sample's reads:
104
+
### 1B.1.b. SE Reads: Prepare Data for Obitools
105
+
To get into the pipeline, with reads formatted for obitools, use a pass through ngsfilter designed to fail all reads, then take all unidentified reads for your sample and move forward.
105
106
`01_scripts/02_ngsfilter_SE_exp_unident.sh`
106
-
(in brief: `ngsfilter -t $interp -u 04_samples/*_unidentified.fq 02_raw_data/*_001.fastq > 04_samples/*_assi.fq`)
107
-
As noted, all the unidentified files will have full data in them, and the assigned are empty.
107
+
All the data should be in 'unidentified' files per sample, and all 'assigned' files should be empty here.
108
108
109
-
As above, we need to annotate each read accession with a sample ID (no ngsfilter success), then combine all files into one:
109
+
Each read will then be annotated with a sample ID, then all files combined together:
110
110
`01_scripts/obiannotate_unident.sh`
111
-
(in brief: `obiannotate -S sample:$i 04_samples/*L001_Rq_unidentified.fq > 04_samples/*_sannot.fq`)
112
-
(in brief: `cat 04_samples/*_sannot.fq > 04_samples/merged_data_assi.fq`)
113
111
114
-
To improve identification of identical amplicons, cut the SE data to a uniform size using cutadapt:
112
+
To identify identical amplicons, cut SE data down to a uniform size (230 bp) with cutadapt:
### Variant B. De-multiplexed multiple-amplicon (SE option only)
119
-
Single-end data, enter the obitools pipeline by using ngsfilter with the primer sequence split into the first six basepairs as a fake 'barcode' and the last sequence of the primer as the primer sequence. This way you can, per sample, de-multiplex your data by amplicon type.
117
+
SE data will enter the obitools pipeline by using ngsfilter with the primer sequence for the sample split into the first six basepairs as a fake 'barcode' and the last sequence of the primer as the primer sequence. This way you can, per sample, de-multiplex your data by amplicon type.
120
118
`01_scripts/02_ngsfilter.sh`
121
-
(as described above)
122
119
123
-
Then per sample, the data can be split into the two amplicon types (it is currently just named in the accession):
120
+
Per sample, the data can be split into the two amplicon types (it is currently just named in the accession):
124
121
`01_scripts/00b_split_by_type.sh`
125
122
Note here that this script will need to be edited for your use. Currently uses grep to take the following three lines after the identifier of interest.
0 commit comments