README.md 9.82 KB
Newer Older
Bastien Macé's avatar
Bastien Macé committed
1
2
# swarm_and_obitools

Bastien MACE's avatar
Bastien MACE committed
3
4
**Bastien Macé, 2020**

Bastien MACE's avatar
Bastien MACE committed
5
6
7
8
9
10
11
12
13
_________________________________


# Table of contents

  * [Introduction](#introduction)
  * [Installation](#installation)
    + [Preliminary steps for OBITools](#preliminary-steps-for-obitools)
    + [Preliminary steps for swarm](#preliminary-steps-for-swarm)
Bastien Macé's avatar
Bastien Macé committed
14
  * [STEP 1 : Pair-ended merging (OBITools)](#step1)
Bastien MACE's avatar
Bastien MACE committed
15
16
17
18
19
  * [STEP 2 : Demultiplexing (OBITools)](#step2)
  * [STEP 3 : Dereplication (OBITools)](#step3)
  * [STEP 4 : Filtering (OBITools)](#step4)
  * [STEP 5 : Elimination of PCR errors (OBITools)](#step5)
  * [STEP 6 : Taxonomic assignment (OBITools)](#step6)
Bastien MACE's avatar
Bastien MACE committed
20
  * [STEP 7 : Gathering in OTUs (swarm)](#step7)
Bastien MACE's avatar
Bastien MACE committed
21
22
23
24
  * [STEP 8 : Analyse your results](#step8)

_________________________________

Bastien MACE's avatar
Bastien MACE committed
25
26
## Introduction

Bastien MACE's avatar
Bastien MACE committed
27
28
29
30
31
32
33
This project is based on the idea that gathering similar sequences allows to faithfully study them by eliminating sequences generated from PCR or NGS errors.

For that, we will use the OBITools commands and swarm.

- [OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home) are commands written in python
- [swarm](https://github.com/torognes/swarm) is a command written in C++ and which can be used with a Unix shell

Bastien Macé's avatar
Bastien Macé committed
34
In this example, two datasets are used because the study analyzes the result of a pair-end sequencing (Example of filtrated eDNA from aquarium seawater).
Bastien MACE's avatar
Bastien MACE committed
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

## Installation

### Preliminary steps for OBITools

- First you need to have Anaconda installed

If it's not the case, click on this [link](https://www.anaconda.com/products/individual/get-started) and download it.

Install the download in your shell :
```
bash Anaconda3-2020.07-Linux-x86_64.sh
```

Then, close your shell and reopen it.
Verify conda is correctly installed. It should be here :
```
~/anaconda3/bin/conda
```

Write the following line :
```
conda config --set auto_activate_base false
```

- Create your new environment obitools from your root in your corresponding path. For example :
```
ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML
```

Now you can activate your environment :
```
conda activate obitools
```
And deactivate it :
```
conda deactivate
```

### Preliminary steps for swarm

- Get the compressed packaged on the [creator GitHub](https://github.com/torognes/swarm) in your downloads folder and install it :
```
git clone https://github.com/torognes/swarm.git
cd swarm/
make
```

- Copy the binary to make the command accessible for all users :
```
cp -r ./bin/swarm /usr/local/bin
```

<a name="step1"></a>
Bastien Macé's avatar
Bastien Macé committed
90
## STEP 1 : Pair-ended merging
Bastien MACE's avatar
Bastien MACE committed
91
92
93
94
95
96
97
98
99
100
101

First, unzip your data in your shell if you need :
```
unzip mullus_surmuletus_data.zip
```

Activate your environment in your shell :
```
conda activate obitools
```

Bastien Macé's avatar
Bastien Macé committed
102
Use the command _illuminapairedend_ to make the pair-ended merging from the forward and reverse strands of the sequences you have in your data. The command aligns the complementary strands in order to get a longer sequence. In fact, after PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.
Bastien MACE's avatar
Bastien MACE committed
103
```
Bastien Macé's avatar
Bastien Macé committed
104
105
106
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_R1.fastq mullus_surmuletus_data/Aquarium_2_R2.fastq > Aquarium_2.fastq
# a new .fastq file is created, it contains the sequences after the merging of forward and reverse strands
# alignments which have a quality score higher than 40 (-- score-min=40) are merged and annotated "aligned", while alignemnts with a lower quality score are concatenated and annotated "joined"
Bastien MACE's avatar
Bastien MACE committed
107
108
```

Bastien Macé's avatar
Bastien Macé committed
109
To only conserve the sequences which have been merged, use _obigrep_ :
Bastien MACE's avatar
Bastien MACE committed
110
```
Bastien Macé's avatar
Bastien Macé committed
111
obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
Bastien MACE's avatar
Bastien MACE committed
112
# -p requires a python expression
Bastien Macé's avatar
Bastien Macé committed
113
# python creates a new dataset (.ali.fastq) which only contains the sequences annotated "aligned"
Bastien MACE's avatar
Bastien MACE committed
114
115
116
```

<a name="step2"></a>
Bastien Macé's avatar
Bastien Macé committed
117
## STEP 2 : Demultiplexing
Bastien MACE's avatar
Bastien MACE committed
118

Bastien Macé's avatar
Bastien Macé committed
119
A _.txt_ file assigns each sequence to its sample thanks to its tag, because each tag corresponds to a reverse or a forward sequence from a sample.
Bastien MACE's avatar
Bastien MACE committed
120

Bastien Macé's avatar
Bastien Macé committed
121
To compare the sequences next, you need to remove the tags and the primers, by using the _ngsfilter_ command :
Bastien MACE's avatar
Bastien MACE committed
122
```
Bastien Macé's avatar
Bastien Macé committed
123
124
125
126
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
# the command creates new files :
# ".unidentified.fastq" file contains the sequences that were not assigned whith a correct tag
# ".ali.assigned.fastq" file contains the sequences that were assigned with a correct tag, so they contain only the barcode sequences
Bastien MACE's avatar
Bastien MACE committed
127
128
129
130
131
132
```

Then, separate your _.ali.assigned.fastq_ files depending on their samples in placing them in a dedicated folder (useful for next steps) :
```
mkdir samples
# create the folder
Bastien Macé's avatar
Bastien Macé committed
133
134
mv -t samples Aquarium_2.ali.assigned.fastq
# place the latests ".fastq" files in the folder
Bastien MACE's avatar
Bastien MACE committed
135
cd samples
Bastien Macé's avatar
Bastien Macé committed
136
137
138
139
obisplit -t samples --fastq sample/Aquarium_2.ali.assigned.fastq
# separate the files depending on their samples
mv -t ./swarm_and_obitools Aquarium_2.ali.assigned.fastq
# remove the original files from the folder
Bastien MACE's avatar
Bastien MACE committed
140
141
```

Bastien Macé's avatar
Bastien Macé committed
142
Now you have as many files as samples, containing merged pair-ended and demultiplexed sequences.
Bastien MACE's avatar
Bastien MACE committed
143
144

<a name="step3"></a>
Bastien Macé's avatar
Bastien Macé committed
145
## STEP 3 : Dereplication
Bastien MACE's avatar
Bastien MACE committed
146

Bastien Macé's avatar
Bastien Macé committed
147
Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header with _obiuniq_ :
Bastien MACE's avatar
Bastien MACE committed
148
149
150
151
```
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta
```

Bastien Macé's avatar
Bastien Macé committed
152
153
This command also transforms _fastq_ files into fasta format.

Bastien MACE's avatar
Bastien MACE committed
154
<a name="step4"></a>
Bastien Macé's avatar
Bastien Macé committed
155
## STEP 4 : Filtering
Bastien MACE's avatar
Bastien MACE committed
156
157
158
159

The _obigrep_ command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons :
```
obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
Bastien Macé's avatar
Bastien Macé committed
160
161
# "-l 20" option eliminates sequences with a length shorter than 20 bp
# "-p 'count>=10'" option eliminates sequences with an abundance inferior to 10
Bastien MACE's avatar
Bastien MACE committed
162
163
164
```

<a name="step5"></a>
Bastien Macé's avatar
Bastien Macé committed
165
## STEP 5 : Elimination of PCR errors
Bastien MACE's avatar
Bastien MACE committed
166

Bastien Macé's avatar
Bastien Macé committed
167
_obiclean_ is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the  amplicons, and calculates the ratio between the abundance of the two amplicons aligned. If there is only one dissimilarity (parameter by default, can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one.
Bastien MACE's avatar
Bastien MACE committed
168
169
170
171
172

Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".

```
obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
Bastien Macé's avatar
Bastien Macé committed
173
# here, the command returns only the sequences tagged "head" by the algorithm, and the chosen ratio is 0.05
Bastien MACE's avatar
Bastien MACE committed
174
175
176
```

<a name="step6"></a>
Bastien Macé's avatar
Bastien Macé committed
177
178
179
## STEP 6 : Taxonomic assignment

_ecotag_ is a command which permits to assign each head amplicon to its corresponding taxon. It requires to first having used [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) with your primers used to amplify your sequences. This command have given a file containing the taxons which can potentially be amplified by the selected primers. _ecotag_ permits to assign your sequences to one of these taxons, with a minimum similarity score fixed at a chosen value, and so to be sure that your sequences come from the correct taxon. However, this step is optionnal is your primers are specific enough.
Bastien MACE's avatar
Bastien MACE committed
180
181
182

```
ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
Bastien Macé's avatar
Bastien Macé committed
183
# only the sequences with a similarity score higher than 0.95 are annotated to their corresponding taxon
Bastien MACE's avatar
Bastien MACE committed
184
185
186
187
188
```

Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance :
```
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
Bastien Macé's avatar
Bastien Macé committed
189
# only the attribute "count" is conserved
Bastien MACE's avatar
Bastien MACE committed
190
191
192
```

<a name="step7"></a>
Bastien MACE's avatar
Bastien MACE committed
193
## STEP 7 : Gathering in OTUs (swarm)
Bastien MACE's avatar
Bastien MACE committed
194

Bastien MACE's avatar
Bastien MACE committed
195
196
197
198
199
200
201
_swarm_ gathers the sequences in OTU thanks to this algorithm :
- First, sequences are pairwise aligned to count the number of dissimilarities between them
- A threshold _d_ is chosen, when the number of dissimilarities is inferior or equal to _d_, both sequences are gathered in a same OTU
- This process is repeated to add iteratively the sequences to an OTU
- The most abundant sequence of each OTU is chosen to represent the OTU
- The abundance of the OTU is constituted by adding the abundances of each sequence included in the OTU

Bastien MACE's avatar
Bastien MACE committed
202
```
Bastien Macé's avatar
Bastien Macé committed
203
swarm -z -d 1 -o stats_Aquarium_2.txt -w Aquarium_2.end.fasta < Aquarium_2.tag_1.fasta
Bastien MACE's avatar
Bastien MACE committed
204
205
206
207
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
# "-w" option gives a "fasta" file with the representative sequence of each OTU
Bastien MACE's avatar
Bastien MACE committed
208
```
Bastien MACE's avatar
Bastien MACE committed
209
An option called _fastidious_ can be added, with  _-f_, in order to integrate small OTUs in larger related OTUs. We don't use it here because it doesn't change the output at all for this example. 
Bastien MACE's avatar
Bastien MACE committed
210
211
212
213

<a name="step8"></a>
## STEP 8 : Analyse your results

Bastien MACE's avatar
Bastien MACE committed
214
215
216
217
218
219
220
Then, you can filtrate your OTUs and the amplicons present in it.

For example, _Elbrecht et al._ recommend in their [publication](https://peerj.com/articles/4644/) to eliminate :
- OTUs with an abundance inferior to 0.01%
- Amplicons with an abundance inferior to 0.003%
- Amplicons with a relative abundance inferior to 5% in their OTU

Bastien MACE's avatar
Bastien MACE committed
221
Now you can make a statistical analysis to evaluate your filtering quality, after comparing the OTUs returned by the pipeline with your reference dataset.