README.md 21.1 KB
Newer Older
Bastien Macé's avatar
Bastien Macé committed
1
2
# eDNA_intra_pipeline_comparison

Bastien Macé's avatar
Bastien Macé committed
3
4
**Bastien Macé, 2021**

Bastien Macé's avatar
Bastien Macé committed
5
This project corresponds to the bioinformatics steps of [this published article](https://onlinelibrary.wiley.com/doi/10.1002/edn3.269), based on the datasets available [here](https://zenodo.org/record/4570303).
Bastien Macé's avatar
Bastien Macé committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
_________________________________


# Table of contents

  * [I - Introduction](#intro)
  * [II - Installation](#install)
    + [Preliminary steps for OBITools](#preliminary-steps-for-obitools)
    + [Preliminary steps for DADA2](#preliminary-steps-for-dada2)
    + [Preliminary steps for SWARM](#preliminary-steps-for-swarm)
    + [Preliminary steps for LULU](#preliminary-steps-for-lulu)
    + [Preliminary steps for VSEARCH](#preliminary-steps-for-vsearch)
  * [III - Pre-processing steps](#step1)
  * [IV - Key processing steps](#step2)
    + [IV - 1 - OBITOOLS processing step (Pipelines A)](#step21)
    + [IV - 2 - DADA2 processing step (Pipelines B)](#step22)
    + [IV - 3 - SWARM processing step (Pipelines C)](#step23)
    + [IV - 4 - SWARM + LULU processing step (Pipelines D)](#step24)
Bastien Macé's avatar
Bastien Macé committed
24
25
26
27
28
29
  * [V - Abundance filtering step](#step3)  
  * [VI - Post-processing steps](#step4)
    + [VI - 1 - No post-processing step (Pipelines A1/B1/C1/D1)](#step41)
    + [VI - 2 - Bimeric sequences removal (Pipelines A2/B2/C2/D2)](#step42)
    + [VI - 3 - Chimeric sequences removal (Pipelines A3/B3/C3/D3)](#step43)
  * [VII - Analyse your results](#step5)
Bastien Macé's avatar
Bastien Macé committed
30
31
32
33

_________________________________

<a name="intro"></a>
Bastien Macé's avatar
Bastien Macé committed
34
## I - Introduction
Bastien Macé's avatar
Bastien Macé committed
35

Bastien Macé's avatar
Bastien Macé committed
36
This project aims to compare twelve bioinformatics pipelines based on five existing metabarcoding programs to make recommendations about population-level inference from environmental DNA sequence data.
Bastien Macé's avatar
Bastien Macé committed
37

Bastien Macé's avatar
Bastien Macé committed
38
Data processing is necessary in metabarcoding studies to eliminate false sequences which are generated during amplification and sequencing, and particularly for intraspecific studies from eDNA samples, where the presence of false sequences in the data can over-estimate the intraspecific genetic variability.
Bastien Macé's avatar
Bastien Macé committed
39
This is why there is a need in filtering sequences with bioinformatics pipelines. Different bioinformatics tools have been developped to handle metabarcoding data. Here, we propose to compare some of them, by building twelve unique pipelines.
Bastien Macé's avatar
Bastien Macé committed
40

Bastien Macé's avatar
Bastien Macé committed
41
For that, we use the following programs:
Bastien Macé's avatar
Bastien Macé committed
42

Bastien Macé's avatar
Bastien Macé committed
43
44
45
46
47
- [OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home): a set of commands written in Python
- [dada2](https://benjjneb.github.io/dada2/index.html): a R package
- [swarm](https://github.com/torognes/swarm): a command written in C++
- [lulu](https://github.com/tobiasgf/lulu): a R package
- [vsearch](https://github.com/torognes/vsearch): a set of commands written in C++
Bastien Macé's avatar
Bastien Macé committed
48

Bastien Macé's avatar
Bastien Macé committed
49
The following figure summarizes the twelve pipelines compared in our study:
Bastien Macé's avatar
Bastien Macé committed
50

Bastien Macé's avatar
Bastien Macé committed
51
![Figure](Figure.jpg)
Bastien Macé's avatar
Bastien Macé committed
52

Bastien Macé's avatar
Bastien Macé committed
53
In this study, we analyze the results of a paired-end sequencing, after extraction and amplification of filtrated eDNA from aquarium seawater, to detect intraspecific haplotypic variability in *Mullus surmuletus*. Only one aquarium is given as example in the scripts.
Bastien Macé's avatar
Bastien Macé committed
54

Bastien Macé's avatar
Bastien Macé committed
55
<a name="install"></a>
Bastien Macé's avatar
Bastien Macé committed
56
## II - Installation
Bastien Macé's avatar
Bastien Macé committed
57
58
59

### Preliminary steps for OBITools

Bastien Macé's avatar
Bastien Macé committed
60
After installing the [OBITools](https://git.metabarcoding.org/obitools/obitools), you have to create an environment from your root in your corresponding path, in your bash terminal as follows:
Bastien Macé's avatar
Bastien Macé committed
61
62
63
64
65
```
ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML
```

Bastien Macé's avatar
Bastien Macé committed
66
Now you can activate your environment before starting **OBITools** commands:
Bastien Macé's avatar
Bastien Macé committed
67
68
69
```
conda activate obitools
```
Bastien Macé's avatar
Bastien Macé committed
70

Bastien Macé's avatar
Bastien Macé committed
71
And deactivate it:
Bastien Macé's avatar
Bastien Macé committed
72
73
74
```
conda deactivate
```
Bastien Macé's avatar
Bastien Macé committed
75
### Preliminary steps for dada2
Bastien Macé's avatar
Bastien Macé committed
76

Bastien Macé's avatar
Bastien Macé committed
77
You need to have a recent R version (minimum 3.6.2) to install the package:
Bastien Macé's avatar
Bastien Macé committed
78
79
80
81
```
install.packages("dada2")
```

Bastien Macé's avatar
Bastien Macé committed
82
### Preliminary steps for swarm
Bastien Macé's avatar
Bastien Macé committed
83

Bastien Macé's avatar
Bastien Macé committed
84
Get the compressed [swarm](https://github.com/torognes/swarm) folder and install it from your bash terminal:
Bastien Macé's avatar
Bastien Macé committed
85
86
87
88
89
90
```
git clone https://github.com/torognes/swarm.git
cd swarm/
make
```

Bastien Macé's avatar
Bastien Macé committed
91
### Preliminary steps for lulu
Bastien Macé's avatar
Bastien Macé committed
92

Bastien Macé's avatar
Bastien Macé committed
93
Install the package from R:
Bastien Macé's avatar
Bastien Macé committed
94
95
96
97
98
99
```
install.packages("lulu")
```

### Preliminary steps for VSEARCH

Bastien Macé's avatar
Bastien Macé committed
100
Get the compressed [vsearch](https://github.com/torognes/vsearch) folder and install it from your bash terminal:
Bastien Macé's avatar
Bastien Macé committed
101
102
103
104
105
106
107
```
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
./configure
make
sudo make install
Bastien Macé's avatar
Bastien Macé committed
108
109
110
111
112
```

<a name="step1"></a>
## III - Pre-processing steps

Bastien Macé's avatar
Bastien Macé committed
113
### Merging paired-end sequenced reads (OBITools)
Bastien Macé's avatar
Bastien Macé committed
114

Bastien Macé's avatar
Bastien Macé committed
115
Activate your environment for **OBITools** in your bash terminal:
Bastien Macé's avatar
Bastien Macé committed
116
117
118
119
```
conda activate obitools
```

Bastien Macé's avatar
Bastien Macé committed
120
Use the command **illuminapairedend** to make the paired-end merging from the forward and reverse strands of the sequences you have in your data. The command aligns the complementary strands in order to get a longer sequence. After a PCR, the last bases are rarely correctly sequenced, so having the forward and the reverse strands allows to lenghten the sequence thanks to the beginning of the reverse strand which is usually correctly sequenced.
Bastien Macé's avatar
Bastien Macé committed
121
122
123
124
125
126
```
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_F.fastq mullus_surmuletus_data/Aquarium_2_R.fastq > Aquarium_2.fastq
# a new .fastq file is created, it contains the sequences after the merging of forward and reverse strands
# alignments which have a quality score higher than 40 (-- score-min=40) are merged and annotated "aligned", while alignemnts with a lower quality score are concatenated and annotated "joined"
```

Bastien Macé's avatar
Bastien Macé committed
127
To only conserve the sequences which have been merged, use **obigrep**:
Bastien Macé's avatar
Bastien Macé committed
128
129
130
131
132
133
```
obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
# -p requires a python expression
# python creates a new dataset (.ali.fastq) which only contains the sequences annotated "aligned"
```

Bastien Macé's avatar
Bastien Macé committed
134
### Demultiplexing (OBITools)
Bastien Macé's avatar
Bastien Macé committed
135

Bastien Macé's avatar
Bastien Macé committed
136
A .txt file assigns each sequence to its sample thanks to its tag, because each tag corresponds to a reverse or a forward sequence from a sample.
Bastien Macé's avatar
Bastien Macé committed
137

Bastien Macé's avatar
Bastien Macé committed
138
To compare the sequences next, you need to remove the tags and the primers, by using the **ngsfilter** command:
Bastien Macé's avatar
Bastien Macé committed
139
140
141
142
143
144
145
```
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
# the command creates new files :
# ".unidentified.fastq" file contains the sequences that were not assigned whith a correct tag
# ".ali.assigned.fastq" file contains the sequences that were assigned with a correct tag, so they contain only the barcode sequences
```

Bastien Macé's avatar
Bastien Macé committed
146
Then, separate your .ali.assigned.fastq files depending on their samples in placing them in a dedicated folder (useful for next steps):
Bastien Macé's avatar
Bastien Macé committed
147
148
149
150
151
152
153
154
155
156
157
158
159
160
```
mkdir samples
# creates the folder
mv -t samples Aquarium_2.ali.assigned.fastq
# places the latests ".fastq" files in the folder
cd samples
obisplit -t samples --fastq sample/Aquarium_2.ali.assigned.fastq
# separates the files depending on their samples
mv -t ./dada2_and_obitools Aquarium_2.ali.assigned.fastq
# removes the original files from the folder
```

Now you have as many files as samples, containing demultiplexed sequences.

Bastien Macé's avatar
Bastien Macé committed
161
### Be prepared for dada2
Bastien Macé's avatar
Bastien Macé committed
162

Bastien Macé's avatar
Bastien Macé committed
163
Quit your bash terminal and open your IDE for R.
Bastien Macé's avatar
Bastien Macé committed
164

Bastien Macé's avatar
Bastien Macé committed
165
First you have to load the **dada2** package:
Bastien Macé's avatar
Bastien Macé committed
166
167
168
169
```
library("dada2")
```

Bastien Macé's avatar
Bastien Macé committed
170
Select the files you want to analyze in your path containing your demultiplexed data:
Bastien Macé's avatar
Bastien Macé committed
171
172
173
174
175
```
fns <- sort(list.files(path, pattern = ".fastq"", full.names = T))
# the function only extracts files that end with the chosen pattern and they are extracted with their whole path
```

Bastien Macé's avatar
Bastien Macé committed
176
And select the part of the files name you want to keep:
Bastien Macé's avatar
Bastien Macé committed
177
178
179
180
181
182
```
sample.names <- sapply(strsplit(basename(fns), ".fastq"), '[', 1)
# the function "basename" removes all the path up to the file name
# the function "strsplit" removes the pattern written
```

Bastien Macé's avatar
Bastien Macé committed
183
### Filtering & Trimming (dada2)
Bastien Macé's avatar
Bastien Macé committed
184

Bastien Macé's avatar
Bastien Macé committed
185
Initiate the creation of a new folder to store the filtered sequences generated:
Bastien Macé's avatar
Bastien Macé committed
186
187
188
189
190
191
```
filts <- file.path(path, "filtered", paste0(sample.names, ".filt.fastq.gz"))
# builds the path to the new folder, which will be located in the path already used and which name will be "filtered"
# the files are named as described before with sample.names, and the pattern ".filt.fastq.gz" will be added
```

Bastien Macé's avatar
Bastien Macé committed
192
These files are created after trimming and filtering with different criteria:
Bastien Macé's avatar
Bastien Macé committed
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
```
out <- filterAndTrim(fns, filts,
                     truncLen = 235,
                     maxN = 0,
                     maxEE = 1,
                     compress = T,
                     verbose = T)
# "truncLen" value is chosen considering the marker length and define were the reads will be trimmed (after 235 bp here), and reads which are shortened than this value are filtered
# "maxN" is the number of N tolerated in the sequences after filtering (0 here)
# "maxEE" define the maximal number of expected errors tolerated in a read (1 here), based on the quality score (EE = sum(10^(-Q/10)))
# "compress = T" means that the files will be gzipped
# "verbose = T" means that information concerning the number of sequences after filtering will be given
```

The filtering permits to clean the data to eliminate a large number of unexploitable sequences for our study, and the trimming permits to facilitate the sequence comparison in the next steps.

Bastien Macé's avatar
Bastien Macé committed
209
### Dereplication (dada2)
Bastien Macé's avatar
Bastien Macé committed
210

Bastien Macé's avatar
Bastien Macé committed
211
Now you can eliminate all the replications of each sequence from the new .fastq.gz files:
Bastien Macé's avatar
Bastien Macé committed
212
213
214
215
216
217
218
219
220
221
```
derep <- derepFastq(filts)
# the function annotates each sequence with his abundance
```

This dereplication will considerably reduce the processing time of the next steps, and no information is lost as the abundance (or read count) of each sequence is now annotated in its header.

<a name="step2"></a>
## IV - Key processing steps

Bastien Macé's avatar
Bastien Macé committed
222
223
224
225
226
227
228
229
230
For Pipelines A, C, and D, you will need to run the following commands in R to be able to use the dereplicated files with **OBITools** and **swarw**:
```
seqtab <- makeSequenceTable(derep)
uniqueSeqs <- getUniques(seqtab)
uniquesToFasta(uniqueSeqs, "./path/Aquarium2.derep.fasta")
```

The header of the sequences contains the number of reads of each sequence after "size=". This file can be directly used with **swarm**, but for **OBITools** it is necessary to transform "size=" into "count=".

Bastien Macé's avatar
Bastien Macé committed
231
<a name="step21"></a>
Bastien Macé's avatar
Bastien Macé committed
232
### IV - 1 - OBITools processing step (Pipelines A)
Bastien Macé's avatar
Bastien Macé committed
233

Bastien Macé's avatar
Bastien Macé committed
234
The **OBITools** command used in pipelines A is **obiclean**. This command eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the  amplicons, and calculates the ratio between the abundance of the two amplicons aligned. If there is only 1 dissimilarity (parameter by default, can be modified by the user) and if the ratio is lower than a threshold set by the user, the less abundant amplicon is considered as a variant of the most abundant one.
Bastien Macé's avatar
Bastien Macé committed
235
236
237

Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".

Bastien Macé's avatar
Bastien Macé committed
238
239
By only conserving the sequences tagged "head", most of erroneous sequences are eliminated.

Bastien Macé's avatar
Bastien Macé committed
240
The following line is run in a bash terminal, after the R pre-processing steps:
Bastien Macé's avatar
Bastien Macé committed
241
242
243
244
245
```
obiclean -r 0.05 -H Aquarium_2.fasta > Aquarium_2.clean.fasta
# here, the command only returns only the sequences tagged "head" by the algorithm, and the chosen ratio is 0.05
```

Bastien Macé's avatar
Bastien Macé committed
246
For more details on this **OBITools** processing step, see the original publication [here](https://doi.org/10.1111/1755-0998.12428).
Bastien Macé's avatar
Bastien Macé committed
247

Bastien Macé's avatar
Bastien Macé committed
248
<a name="step22"></a>
Bastien Macé's avatar
Bastien Macé committed
249
### IV - 2 - dada2 processing step (Pipelines B)
Bastien Macé's avatar
Bastien Macé committed
250

Bastien Macé's avatar
Bastien Macé committed
251
The **dada2** function used in pipelines B is **learnErrors**. This function is able to distinguish the incorrect sequences from the correct sequences generated during amplification and sequencing, by estimating the sequencing error rate.
Bastien Macé's avatar
Bastien Macé committed
252
253
254
255
256
257
258
259
260

To build the error model, the function alternates estimation of the error rate and inference of sample composition until they converge on a jointly consistent solution.

The algorithm calculates the abundance p-value for each sequence. This p-value is defined by a Poisson distribution, with a parameter correspondig to the rate of amplicons of a sequence i generated from a sequence j.

Before that, a partition is built with the most abundant sequence as the core. All the other sequences are compared to this core. The sequence with the smallest p-value is analyzed : if this p-value is inferior than a parameter of the algorithm (OMEGA_A), this sequence become the core of a new partition. The other sequences joins the partition most likely to have produced the core. This operation is repeated until there is no p-value which falls under the parameter OMEGA_A.

Then, all the sequences from a partition are transformed into their core, so each partition corresponds to a unique sequence : the ASV (Amplicon sequence variant).

Bastien Macé's avatar
Bastien Macé committed
261
The following lines are run in R, directly following the pre-processing steps:
Bastien Macé's avatar
Bastien Macé committed
262
263
264
265
266
267
268
269
270
271
272
273
```
err <- learnErrors(derep[k], randomize=T)
# builds the error model
dadas <- dada(derep[k], err)
# eliminates the false sequences identified by the model to oncly conserve ASVs
seqtab <- makeSequenceTable(dadas)
# constructs a sequence table with the sequences filtered
uniqueSeqs <- getUniques(seqtab)
uniquesToFasta(uniqueSeqs, paste0("PipelineB_", sample.names[k], ".fasta"))
# creates a new ".fasta" file constaining the ASVs
```

Bastien Macé's avatar
Bastien Macé committed
274
For more details on this **dada2** processing step, see the original publication [here](https://doi.org/10.1038/nmeth.3869).
Bastien Macé's avatar
Bastien Macé committed
275

Bastien Macé's avatar
Bastien Macé committed
276
<a name="step23"></a>
Bastien Macé's avatar
Bastien Macé committed
277
### IV - 3 - swarm processing step (Pipelines C)
Bastien Macé's avatar
Bastien Macé committed
278

Bastien Macé's avatar
Bastien Macé committed
279
In pipelines C, **swarm** gathers the sequences in OTUs (Operational taxonomic units). First, sequences are pairwise aligned to count the number of dissimilarities between them. A threshold d is chosen by the user, and when the number of dissimilarities is inferior or equal to d, both sequences are gathered in a same OTU. This process is then repeated to add iteratively each sequences to an OTU, and the most abundant sequence of each OTU is chosen to represent the OTU. The abundance of the OTU is constituted by adding the abundances of each sequence included in the OTU
Bastien Macé's avatar
Bastien Macé committed
280

Bastien Macé's avatar
Bastien Macé committed
281
The following line process the algorithm in a bash terminal:
Bastien Macé's avatar
Bastien Macé committed
282
283
284
285
286
287
288
```
swarm -z -d 1 -o stats_Aquarium_2.txt -w Aquarium_2.clustered.fasta < Aquarium_2.fasta
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU (1 here)
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
# "-w" option gives a "fasta" file with the representative sequence of each OTU
```
Bastien Macé's avatar
Bastien Macé committed
289
An option called **fastidious** can be added, with  **-f**, in order to integrate small OTUs in larger related OTUs. We don't use it here because it doesn't change the output at all in our study. 
Bastien Macé's avatar
Bastien Macé committed
290

Bastien Macé's avatar
Bastien Macé committed
291
For more details on this **swarm** processing step, see the original publication [here](https://doi.org/10.1038/nmeth.3869).
Bastien Macé's avatar
Bastien Macé committed
292

Bastien Macé's avatar
Bastien Macé committed
293
<a name="step24"></a>
Bastien Macé's avatar
Bastien Macé committed
294
### IV - 4 - swarm + lulu processing step (Pipelines D)
Bastien Macé's avatar
Bastien Macé committed
295

Bastien Macé's avatar
Bastien Macé committed
296
For pipelines D, the same **swarm** algorithm than in pipelines C was used, with an additional post-clustering step run thanks to the **lulu** algorithm.
Bastien Macé's avatar
Bastien Macé committed
297

Bastien Macé's avatar
Bastien Macé committed
298
299
LULU eliminates some OTUs by merging them to closest more abundant OTUs. The algorithm requires the OTU table procured by SWARM, and an OTU match list to provide the pairwise similarity scores of the OTUs, with a minimum threshold of sequence similarity set at 84% as recommended by the authors. Only OTU pairs with a sequence similarity above 84% can then be interpreted as “parent” for the most abundant one and “daughter” for the other.

Bastien Macé's avatar
Bastien Macé committed
300
As recommanded by the authors, the following line, running in a bash terminal with the **vsearch** program, gives an OTU match list:
Bastien Macé's avatar
Bastien Macé committed
301
302
303
304
305
306
```
vsearch --usearch_global Aquarium_2.fasta --db Aquarium_2.fasta --self --id .84 --iddef 1 --userout match_list_Aquarium_2.txt -userfields query+target+id --maxaccepts 0 --query_cov .9 --maxhits 10
```

Both OTU will possibly be merged provided that the co-occurrence pattern of the OTU pair among samples is higher than 95% and the abundance ratio between the “potential parent” and “potential daughter” is higher than a minimum ratio set by default as the minimum observed ratio.

Bastien Macé's avatar
Bastien Macé committed
307
The following lines, run in R , process the post-clustering curation:
Bastien Macé's avatar
Bastien Macé committed
308
309
310
311
312
313
314
315
316
317
318
319
320
```
library("lulu")

OTUtable <- read.fasta(Aquarium_2.clustered.fasta)
matchlist <- read.table(match_list_Aquarium_2.txt)
# prepare the files needed for LULU processing

curated_results <- lulu(OTUtable, matchlist)
# LULU processing with the lulu R function

curated_results
# shows the OTU names and their abundance after the curation
```
Bastien Macé's avatar
Bastien Macé committed
321

Bastien Macé's avatar
Bastien Macé committed
322
For more details on **lulu**, see the original publication [here](https://doi.org/10.1038/nmeth.3869).
Bastien Macé's avatar
Bastien Macé committed
323
324
325
326

<a name="step3"></a>
## V - Abundance filtering step

Bastien Macé's avatar
Bastien Macé committed
327
In order to manipulate less data and to eliminate an important number of erroneous sequences, an abundance filtering is applied at this point. We use the **obigrep** command from the **OBITools**, to eliminate sequences with an abundance inferior to 10, with the following line:
Bastien Macé's avatar
Bastien Macé committed
328
329
330
331
332
```
obigrep -p 'count>=10' Aquarium_2.fasta > Aquarium_2.grep.fasta
# "-p 'count>=10'" option eliminates sequences with an abundance inferior to 10
```

Bastien Macé's avatar
Bastien Macé committed
333
In our study, this step permitted to eliminate a several number of sequences, without eliminating any true haplotype in the aquarium experiment.
Bastien Macé's avatar
Bastien Macé committed
334
335
336
337
338
339
340
341
342
343
344
345

<a name="step4"></a>
## VI - Post-processing steps

<a name="step41"></a>
### VI - 1 - No post-processing step (Pipelines A1/B1/C1/D1)

After the key processing step, you can decide to stop your pipeline here, use no more program and directly analyze your results.

<a name="step42"></a>
### VI - 2 - Bimeric sequences removal (Pipelines A2/B2/C2/D2)]

Bastien Macé's avatar
Bastien Macé committed
346
347
348
Before this step, it is necessary to transform the .fasta file returned by **obigrep** with this form:
>sq1;size=7180;
GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAACCCCAGAAAACAGGACAGATAAAAAAGAAGACTCAAATAAGTACGAAATAGAGAGAGGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCGCCCGCATGCAACCAGGCATCCCCATCCCTAGTCCCTAAACAGAAACACGCAGTAAGAACCTACCATC
Bastien Macé's avatar
Bastien Macé committed
349

Bastien Macé's avatar
Bastien Macé committed
350
351
>sq2;size=6514;
GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAGTCCCAGAAAACAGGATAGATAAAAAAGAAGAACTCAAATAAGTACGAAATAGGGAGAAGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCGCCCGCATGCAACCAGGCATCCCCATCCCTAGTCCCTAAACAGAAACACGCAGTAAGAACCTACCATC
Bastien Macé's avatar
Bastien Macé committed
352

Bastien Macé's avatar
Bastien Macé committed
353
354
355
356
>sq3;size=6290; 
GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAACCCCAGAAAACAGGACAGATAAAAAAGAAGACTCAAATAAGTACGAAATAGAGAGAGGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCACCCGCATGCAACCAGGCATCCCCATCCCTAGTCCTTAAACAGAAACACGCAGTAAGAACCTACCATC

into a .txt file of this form:
Bastien Macé's avatar
Bastien Macé committed
357

Bastien Macé's avatar
Bastien Macé committed
358
>sequence	abundance
Bastien Macé's avatar
Bastien Macé committed
359
360
361
362
363
364

>GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAACCCCAGAAAACAGGACAGATAAAAAAGAAGACTCAAATAAGTACGAAATAGAGAGAGGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCGCCCGCATGCAACCAGGCATCCCCATCCCTAGTCCCTAAACAGAAACACGCAGTAAGAACCTACCATC	7180

>GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAGTCCCAGAAAACAGGATAGATAAAAAAGAAGAACTCAAATAAGTACGAAATAGGGAGAAGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCGCCCGCATGCAACCAGGCATCCCCATCCCTAGTCCCTAAACAGAAACACGCAGTAAGAACCTACCATC	6514

>GTATTAAAACCATTTTAATGATTTAAACCAATCAAGTCCGAAATCCATTGAAACCCCAGAAAACAGGACAGATAAAAAAGAAGACTCAAATAAGTACGAAATAGAGAGAGGTACAGAAATAGAACTGATGACCGCTAGCGATTTATTAATCAGATTAAACGTGTCCACCCGCATGCAACCAGGCATCCCCATCCCTAGTCCTTAAACAGAAACACGCAGTAAGAACCTACCATC	6290
Bastien Macé's avatar
Bastien Macé committed
365

Bastien Macé's avatar
Bastien Macé committed
366
367
Definition : _We call chimeric sequences, or PCR-mediated recombinant, sequences built from a merging of different closely related DNA templates during PCR. By extension, we call bimeras the two-parent chimeric sequences._

Bastien Macé's avatar
Bastien Macé committed
368
For pipelines A2, B2, C2 and D2, sequences considered as bimeras, or two-parent chimeras, are removed using the **removeBimeraDenovo** function from DADA2. This function mostly points out bimeras by aligning each sequence with all more abundant sequences and detecting a combination of an exact “right parent” and an exact “left parent” of this sequence.
Bastien Macé's avatar
Bastien Macé committed
369

Bastien Macé's avatar
Bastien Macé committed
370
You can remove the sequences considered as bimeras in the table by directly creating a new table, and repeating the same functions for create a new fasta file:
Bastien Macé's avatar
Bastien Macé committed
371
372

```
Bastien Macé's avatar
Bastien Macé committed
373
tab <- read.table(Aquarium_2.txt, header=T)
Bastien Macé's avatar
Bastien Macé committed
374
375
376
377
378
379
380
381
382
seqtab_1 <- makeSequenceTable(tab)
seqtab_2 <- removeBimeraDenovo(seqtab_1, verbose=T)
# processes the bimera removal

uniqueSeqs <- getUniques(seqtab_2)
uniquesToFasta(uniqueSeqs, paste0(sample.names, ".fasta")
# creates the new file without bimeras
```

Bastien Macé's avatar
Bastien Macé committed
383
For more details on this **dada2** bimera removal step, see the original publication [here](https://doi.org/10.1038/nmeth.3869).
Bastien Macé's avatar
Bastien Macé committed
384
385
386
387

<a name="step43"></a>
### VI - 3 - Chimeric sequences removal (Pipelines A3/B3/C3/D3)]

Bastien Macé's avatar
Bastien Macé committed
388
For pipelines A3, B3, C3 and D3, chimeras are removed using **uchime3_denovo** command from VSEARCH program. This command is based on the **uchime2** algorithm. Each sequence is divided into four segments, and the command mostly searches for similarity for each segment to all other sequences using a heuristic method. The best potential parent sequences are then selected, and the query sequence is considered as chimera if a set of default parameters is not exceeded.
Bastien Macé's avatar
Bastien Macé committed
389

Bastien Macé's avatar
Bastien Macé committed
390
The unique following line realizes this algorithm and gives the data without chimeras:
Bastien Macé's avatar
Bastien Macé committed
391
392
393
394
```
vsearch --uchime3_denovo Aquarium_2.fasta --nonchimeras Aquarium2_uchime3.fasta
```

Bastien Macé's avatar
Bastien Macé committed
395
For more details on this **vsearch** chimera removal step, see the original UCHIME2 publication [here](https://doi.org/10.1101/074252).
Bastien Macé's avatar
Bastien Macé committed
396
397
398
399

<a name="step5"></a>
## VII - Analyse your results

Bastien Macé's avatar
Bastien Macé committed
400
Now you can make a statistical analysis to evaluate your filtering quality, after comparing the amplicons returned by the pipeline with your reference dataset.