This project is based on the idea that gathering similar sequences allows to faithfully study them by elminating sequences generated from PCR or NGS errors.
\ No newline at end of file
This project is based on the idea that gathering similar sequences allows to faithfully study them by eliminating sequences generated from PCR or NGS errors.
For that, we will use the OBITools commands and swarm.
-[OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home) are commands written in python
-[swarm](https://github.com/torognes/swarm) is a command written in C++ and which can be used with a Unix shell
In this example, 2 datasets are used, because the study analyzes the sequencing of 2 tiles.
## Installation
### Preliminary steps for OBITools
- First you need to have Anaconda installed
If it's not the case, click on this [link](https://www.anaconda.com/products/individual/get-started) and download it.
Install the download in your shell :
```
bash Anaconda3-2020.07-Linux-x86_64.sh
```
Then, close your shell and reopen it.
Verify conda is correctly installed. It should be here :
```
~/anaconda3/bin/conda
```
Write the following line :
```
conda config --set auto_activate_base false
```
- Create your new environment obitools from your root in your corresponding path. For example :
- Get the compressed packaged on the [creator GitHub](https://github.com/torognes/swarm) in your downloads folder and install it :
```
git clone https://github.com/torognes/swarm.git
cd swarm/
make
```
- Copy the binary to make the command accessible for all users :
```
cp -r ./bin/swarm /usr/local/bin
```
<aname="step1"></a>
## STEP 1 : Pair-end sequencing
First, unzip your data in your shell if you need :
```
unzip mullus_surmuletus_data.zip
```
Activate your environment in your shell :
```
conda activate obitools
```
Use the function _illuminapairedend_ to make the pair-end sequencing from the forward and reverse strands of the sequences you have in your data. In other words, the function aligns the complementary strands in order to get a longer sequence. In fact, during PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.
# a new .fastq file is created, it contains the sequences after the pair-end of forward and reverse sequences which have a quality score higher than 40 (-- score-min=40)
```
To only conserve the sequences which have been aligned, use _obigrep_ :
Now you have as many files as samples, containing pair-ended and demultiplexed sequences.
<aname="step3"></a>
## STEP 3 : Dereplication
Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header :
```
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta
```
<aname="step4"></a>
## STEP 4 : Filtering
The _obigrep_ command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons :
# "-l 20" option filters sequences with a length shorter than 20 bp
# "-p "'count>=10'" option filters sequences with an abundance inferior to 10
```
<aname="step5"></a>
## STEP 5 : Elimination of PCR errors
_obiclean_ is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the amplicons, and calculates the ratio between the abundance of the 2 amplicons. If there is only one dissimilarity (parameter by default, but can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one.
Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".
# here, the command returns only the sequences tagged "head" by the algorithm, and the ratio retained is 0.05
```
<aname="step6"></a>
## STEP 6 : Taxonomic assignment
_ecotag_ is the command which permits to assign each head amplicon to its corresponding taxon. The algorithm compares the amplicons with the sequences from the reference database. If the similarity score is higher than the threshold chosen, the amplicon is assigned to its "taxid" thanks to the taxonomy database.
# only the sequences with a similarity score higher than 0.5 are annotated
```
Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance :
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
# "-w" option gives a "fasta" file with the representative sequence of each OTU
<aname="step8"></a>
## STEP 8 : Analyse your results
Now you can make a statistical analysis to evaluate your filtering quality, after comparing the OTUs returned by the pipeline with your reference dataset.