README.md 9.71 KB
Newer Older
Bastien Macé's avatar
Bastien Macé committed
1
2
# eDNA_intra_pipeline_comparison

Bastien Macé's avatar
Bastien Macé committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
**Bastien Macé, 2021**

_________________________________


# Table of contents

  * [I - Introduction](#intro)
  * [II - Installation](#install)
    + [Preliminary steps for OBITools](#preliminary-steps-for-obitools)
    + [Preliminary steps for DADA2](#preliminary-steps-for-dada2)
    + [Preliminary steps for SWARM](#preliminary-steps-for-swarm)
    + [Preliminary steps for LULU](#preliminary-steps-for-lulu)
    + [Preliminary steps for VSEARCH](#preliminary-steps-for-vsearch)
  * [III - Pre-processing steps](#step1)
  * [IV - Key processing steps](#step2)
    + [IV - 1 - OBITOOLS processing step (Pipelines A)](#step21)
    + [IV - 2 - DADA2 processing step (Pipelines B)](#step22)
    + [IV - 3 - SWARM processing step (Pipelines C)](#step23)
    + [IV - 4 - SWARM + LULU processing step (Pipelines D)](#step24)
  * [V - Post-processing steps](#step3)
    + [V - 1 - No post-processing step (Pipelines A1/B1/C1/D1)](#step31)
    + [V - 2 - Bimeric sequences removal (Pipelines A2/B2/C2/D2)](#step32)
    + [V - 3 - Chimeric sequences removal (Pipelines A3/B3/C3/D3)](#step32)
  * [VI - Analyse your results](#step4)

_________________________________

<a name="intro"></a>
Bastien Macé's avatar
Bastien Macé committed
32
## I - Introduction
Bastien Macé's avatar
Bastien Macé committed
33
34

This project aims to compare twelve bioinformatics pipelines based on five existing metabarcoding programs to make recommendations for data management in intraspecific variability studies using environmental DNA.
Bastien Macé's avatar
Bastien Macé committed
35

Bastien Macé's avatar
Bastien Macé committed
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Data processing is necessary in metabarcoding studies to eliminate false sequences which are generated during amplification and sequencing, and particularly for intraspecific studies from eDNA samples, where the presence of false sequences in the data can over-estimate the intraspecific genetic variability.
That is why there is a need in filtering sequences with bioinformatics pipelines. Different bioinformatics tools have been developped for metabarcoding studies. Here, we propose to compare some of them, by building twelve unique pipelines.

For that, we use the following programs :

- [OBITOOLS](https://git.metabarcoding.org/obitools/obitools/wikis/home) : a set of commands written in python
- [DADA2](https://benjjneb.github.io/dada2/index.html) : a R package
- [SWARM](https://github.com/torognes/swarm) : a command written in C++
- [LULU](https://github.com/tobiasgf/lulu) : a R package
- [VSEARCH](https://github.com/torognes/vsearch) : a set of commands written in C++

In our study, we analyze the results of a paired-end sequencing, after extraction and amplification of filtrated eDNA from aquarium seawater, to detect intraspecific haplotypic variability in *Mullus surmuletus*.

<a name="install"></a>
Bastien Macé's avatar
Bastien Macé committed
50
## II - Installation
Bastien Macé's avatar
Bastien Macé committed
51
52
53

### Preliminary steps for OBITools

Bastien Macé's avatar
Bastien Macé committed
54
You need to have Anaconda installed. If it's not the case, click on this [link](https://www.anaconda.com/products/individual/get-started) and download it. Install the download in your shell, close your shell and reopen it.
Bastien Macé's avatar
Bastien Macé committed
55
56
57
58
59
60
61
62
63
64
65

Verify conda is correctly installed. It should be here :
```
~/anaconda3/bin/conda
```

Write the following line :
```
conda config --set auto_activate_base false
```

Bastien Macé's avatar
Bastien Macé committed
66
Then, create your new environment obitools from your root in your corresponding path. For example :
Bastien Macé's avatar
Bastien Macé committed
67
68
69
70
71
```
ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML
```

Bastien Macé's avatar
Bastien Macé committed
72
Now you can activate your environment before starting OBITOOLS commands :
Bastien Macé's avatar
Bastien Macé committed
73
74
75
```
conda activate obitools
```
Bastien Macé's avatar
Bastien Macé committed
76

Bastien Macé's avatar
Bastien Macé committed
77
78
79
80
81
82
And deactivate it :
```
conda deactivate
```
### Preliminary steps for DADA2

Bastien Macé's avatar
Bastien Macé committed
83
84
85
86
87
88
89
You need to have a recent R version (3.6.2 minimum). If it's not the case, click on this [link](hhttps://cran.r-project.org/) and download it.

Then, open your IDE (RStudio for example), and install the package :
```
install.packages("dada2")
```

Bastien Macé's avatar
Bastien Macé committed
90
91
### Preliminary steps for SWARM

Bastien Macé's avatar
Bastien Macé committed
92
Get the compressed folder on the [creator GitHub](https://github.com/torognes/swarm) in your downloads folder and install it :
Bastien Macé's avatar
Bastien Macé committed
93
94
95
96
97
98
```
git clone https://github.com/torognes/swarm.git
cd swarm/
make
```

Bastien Macé's avatar
Bastien Macé committed
99
100
### Preliminary steps for LULU

Bastien Macé's avatar
Bastien Macé committed
101
102
103
104
105
106
107
Open your IDE for R (RStudio for example), and install the package :
```
install.packages("lulu")
```

### Preliminary steps for VSEARCH

Bastien Macé's avatar
Bastien Macé committed
108
Get the compressed folder on the [creator GitHub](https://github.com/torognes/vsearch) in your downloads folder and install it :
Bastien Macé's avatar
Bastien Macé committed
109
110
111
112
113
114
115
```
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
./configure
make
sudo make install
Bastien Macé's avatar
Bastien Macé committed
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
```

<a name="step1"></a>
## III - Pre-processing steps

### Merging paired-end sequenced reads (OBITOOLS)

Activate your environment for OBITOOLS in your shell :
```
conda activate obitools
```

Use the command _illuminapairedend_ to make the paired-end merging from the forward and reverse strands of the sequences you have in your data. The command aligns the complementary strands in order to get a longer sequence. In fact, after PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.
```
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_F.fastq mullus_surmuletus_data/Aquarium_2_R.fastq > Aquarium_2.fastq
# a new .fastq file is created, it contains the sequences after the merging of forward and reverse strands
# alignments which have a quality score higher than 40 (-- score-min=40) are merged and annotated "aligned", while alignemnts with a lower quality score are concatenated and annotated "joined"
```

To only conserve the sequences which have been merged, use _obigrep_ :
```
obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
# -p requires a python expression
# python creates a new dataset (.ali.fastq) which only contains the sequences annotated "aligned"
```

### Demultiplexing (OBITOOLS)

A _.txt_ file assigns each sequence to its sample thanks to its tag, because each tag corresponds to a reverse or a forward sequence from a sample.

To compare the sequences next, you need to remove the tags and the primers, by using the _ngsfilter_ command :
```
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
# the command creates new files :
# ".unidentified.fastq" file contains the sequences that were not assigned whith a correct tag
# ".ali.assigned.fastq" file contains the sequences that were assigned with a correct tag, so they contain only the barcode sequences
```

Then, separate your _.ali.assigned.fastq_ files depending on their samples in placing them in a dedicated folder (useful for next steps) :
```
mkdir samples
# creates the folder
mv -t samples Aquarium_2.ali.assigned.fastq
# places the latests ".fastq" files in the folder
cd samples
obisplit -t samples --fastq sample/Aquarium_2.ali.assigned.fastq
# separates the files depending on their samples
mv -t ./dada2_and_obitools Aquarium_2.ali.assigned.fastq
# removes the original files from the folder
```

Now you have as many files as samples, containing demultiplexed sequences.

### Be prepared for DADA2

Quit your shell and open your IDE for R.

First you have to load the dada2 package :
```
library("dada2")
```

Select the files you want to analyze in your path containing your demultiplexed data :
```
fns <- sort(list.files(path, pattern = ".fastq"", full.names = T))
# the function only extracts files that end with the chosen pattern and they are extracted with their whole path
```

And select the part of the files name you want to keep :
```
sample.names <- sapply(strsplit(basename(fns), ".fastq"), '[', 1)
# the function "basename" removes all the path up to the file name
# the function "strsplit" removes the pattern written
```

### Filtering & Trimming (DADA2)

Initiate the creation of a new folder to store the filtered sequences generated :
```
filts <- file.path(path, "filtered", paste0(sample.names, ".filt.fastq.gz"))
# builds the path to the new folder, which will be located in the path already used and which name will be "filtered"
# the files are named as described before with sample.names, and the pattern ".filt.fastq.gz" will be added
```

These files are created after trimming and filtering with different criteria :
```
out <- filterAndTrim(fns, filts,
                     truncLen = 235,
                     maxN = 0,
                     maxEE = 1,
                     compress = T,
                     verbose = T)
# "truncLen" value is chosen considering the marker length and define were the reads will be trimmed (after 235 bp here), and reads which are shortened than this value are filtered
# "maxN" is the number of N tolerated in the sequences after filtering (0 here)
# "maxEE" define the maximal number of expected errors tolerated in a read (1 here), based on the quality score (EE = sum(10^(-Q/10)))
# "compress = T" means that the files will be gzipped
# "verbose = T" means that information concerning the number of sequences after filtering will be given
```

The filtering permits to clean the data to eliminate a large number of unexploitable sequences for our study, and the trimming permits to facilitate the sequence comparison in the next steps.

### Dereplication (DADA2)

Now you can eliminate all the replications of each sequence from the new _.fastq.gz_ files :
```
derep <- derepFastq(filts)
# the function annotates each sequence with his abundance
```

This dereplication will considerably reduce the processing time of the next steps, and no information is lost as the abundance (or read count) of each sequence is now annotated in its header.

<a name="step2"></a>
## IV - Key processing steps

<a name="step21"></a>
## IV - 1 - OBITOOLS processing step (Pipelines A)

<a name="step22"></a>
## IV - 2 - DADA2 processing step (Pipelines B)

<a name="step23"></a>
## IV - 3 - SWARM processing step (Pipelines C)

<a name="step24"></a>
## IV - 4 - SWARM + LULU processing step (Pipelines D)