Commit 237fc13a authored by Bastien MACE's avatar Bastien MACE
Browse files

script redaction

parent aaad5f9b
......@@ -2,6 +2,205 @@
**Bastien Macé, 2020**
_________________________________
# Table of contents
* [Introduction](#introduction)
* [Installation](#installation)
+ [Preliminary steps for OBITools](#preliminary-steps-for-obitools)
+ [Preliminary steps for swarm](#preliminary-steps-for-swarm)
* [STEP 1 : Pair-end sequencing (OBITools)](#step1)
* [STEP 2 : Demultiplexing (OBITools)](#step2)
* [STEP 3 : Dereplication (OBITools)](#step3)
* [STEP 4 : Filtering (OBITools)](#step4)
* [STEP 5 : Elimination of PCR errors (OBITools)](#step5)
* [STEP 6 : Taxonomic assignment (OBITools)](#step6)
* [STEP 7 : Gathering in OTU (swarm)](#step7)
* [STEP 8 : Analyse your results](#step8)
_________________________________
## Introduction
This project is based on the idea that gathering similar sequences allows to faithfully study them by elminating sequences generated from PCR or NGS errors.
\ No newline at end of file
This project is based on the idea that gathering similar sequences allows to faithfully study them by eliminating sequences generated from PCR or NGS errors.
For that, we will use the OBITools commands and swarm.
- [OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home) are commands written in python
- [swarm](https://github.com/torognes/swarm) is a command written in C++ and which can be used with a Unix shell
In this example, 2 datasets are used, because the study analyzes the sequencing of 2 tiles.
## Installation
### Preliminary steps for OBITools
- First you need to have Anaconda installed
If it's not the case, click on this [link](https://www.anaconda.com/products/individual/get-started) and download it.
Install the download in your shell :
```
bash Anaconda3-2020.07-Linux-x86_64.sh
```
Then, close your shell and reopen it.
Verify conda is correctly installed. It should be here :
```
~/anaconda3/bin/conda
```
Write the following line :
```
conda config --set auto_activate_base false
```
- Create your new environment obitools from your root in your corresponding path. For example :
```
ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML
```
Now you can activate your environment :
```
conda activate obitools
```
And deactivate it :
```
conda deactivate
```
### Preliminary steps for swarm
- Get the compressed packaged on the [creator GitHub](https://github.com/torognes/swarm) in your downloads folder and install it :
```
git clone https://github.com/torognes/swarm.git
cd swarm/
make
```
- Copy the binary to make the command accessible for all users :
```
cp -r ./bin/swarm /usr/local/bin
```
<a name="step1"></a>
## STEP 1 : Pair-end sequencing
First, unzip your data in your shell if you need :
```
unzip mullus_surmuletus_data.zip
```
Activate your environment in your shell :
```
conda activate obitools
```
Use the function _illuminapairedend_ to make the pair-end sequencing from the forward and reverse strands of the sequences you have in your data. In other words, the function aligns the complementary strands in order to get a longer sequence. In fact, during PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.
```
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Med_R1.fastq mullus_surmuletus_data/Med_R2.fastq > Med.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Atl_R1.fastq mullus_surmuletus_data/Atl_R2.fastq > Atl.fastq
# a new .fastq file is created, it contains the sequences after the pair-end of forward and reverse sequences which have a quality score higher than 40 (-- score-min=40)
```
To only conserve the sequences which have been aligned, use _obigrep_ :
```
obigrep -p 'mode!="joined"' Med.fastq > Med.ali.fastq
obigrep -p 'mode!="joined"' Atl.fastq > Atl.ali.fastq
# -p requires a python expression
# the unaligned sequences are notified with mode="joined" whereas the aligned sequences are notified with mode="aligned"
# so here python creates new datasets (.ali.fastq) which only contain the sequences notified "aligned"
```
<a name="step2"></a>
## STEP 2 : Demultiplexing
The _.txt_ files assign each sequence to its sample thanks to its tag because each tag corresponds to a reverse or a forward sequence from a sample.
To be able to compare the sequences next, you need to remove tags and primers, and to use the function _ngsfilter_ :
```
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Med.unidentified.fastq Med.ali.fastq > Med.ali.assigned.fastq
ngsfilter -t mullus_surmuletus_data/Atl_corr_tags.txt -u Atl.unidentified.fastq Atl.ali.fastq > Atl.ali.assigned.fastq
# the function creates new files :
# ".unidentified.fastq" files contain the sequences that were not assigned whith a correct tag
# ".ali.assigned.fastq" files contain the sequences that were assigned with a correct tag, so they contain only the barcode sequences
```
Then, separate your _.ali.assigned.fastq_ files depending on their samples in placing them in a dedicated folder (useful for next steps) :
```
mkdir samples
# create the folder
mv -t samples Med.ali.assigned.fastq Atl.ali.assigned.fastq
# place the latests .fastq files in the folder
cd samples
obisplit -t samples --fastq sample/Med.ali.assigned.fastq
obisplit -t samples --fastq sample/Atl.ali.assigned.fastq
# separation of the files depending on their samples
mv -t ./dada2_and_obitools Med.ali.assigned.fastq Atl.ali.assigned.fastq
# removing the original files from the folder
```
Now you have as many files as samples, containing pair-ended and demultiplexed sequences.
<a name="step3"></a>
## STEP 3 : Dereplication
Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header :
```
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta
```
<a name="step4"></a>
## STEP 4 : Filtering
The _obigrep_ command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons :
```
obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
# "-l 20" option filters sequences with a length shorter than 20 bp
# "-p "'count>=10'" option filters sequences with an abundance inferior to 10
```
<a name="step5"></a>
## STEP 5 : Elimination of PCR errors
_obiclean_ is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the amplicons, and calculates the ratio between the abundance of the 2 amplicons. If there is only one dissimilarity (parameter by default, but can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one.
Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".
```
obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
# here, the command returns only the sequences tagged "head" by the algorithm, and the ratio retained is 0.05
```
<a name="step6"></a>
## STEP 6 : Taxonomic assignment
_ecotag_ is the command which permits to assign each head amplicon to its corresponding taxon. The algorithm compares the amplicons with the sequences from the reference database. If the similarity score is higher than the threshold chosen, the amplicon is assigned to its "taxid" thanks to the taxonomy database.
```
ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
# only the sequences with a similarity score higher than 0.5 are annotated
```
Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance :
```
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
# the only attribute "count" is conserved
```
<a name="step7"></a>
## STEP 7 : Gathering in OTU
swarm -z -d 1 -o stats_Aq2.txt -w pipeline3_Aq2.fasta < pipeline1_Aq2.tag_1.fasta
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
# "-w" option gives a "fasta" file with the representative sequence of each OTU
<a name="step8"></a>
## STEP 8 : Analyse your results
Now you can make a statistical analysis to evaluate your filtering quality, after comparing the OTUs returned by the pipeline with your reference dataset.
cd ## once you have conda installed, close the shell, reopen it and paste this
## once you have conda installed, close the shell, reopen it and paste this
## following line :
conda config --set auto_activate_base false
......@@ -17,80 +17,117 @@ unzip mullus_surmuletus_data.zip
## activate your environment :
conda activate obitools
## use the function illuminapairedend to make the pair-end sequencing from
## use the function "illuminapairedend" to make the pair-end sequencing from
## the forward and reverse sequences you have in your data :
illuminapairedend --score-min=40 -r mullus_surmuletus_data/200221_SN234_A_L001_AIMI-199_R1.fastq mullus_surmuletus_data/200221_SN234_A_L001_AIMI-199_R2.fastq > AIMI-199.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/200221_SN234_A_L001_AIMI-200_R1.fastq mullus_surmuletus_data/200221_SN234_A_L001_AIMI-200_R2.fastq > AIMI-200.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Med_R1.fastq mullus_surmuletus_data/Med_R2.fastq > Med.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Atl_R1.fastq mullus_surmuletus_data/Atl_R2.fastq > Atl.fastq
## this function will create a new .fastq file which will contain the sequences
## this function creates a new ".fastq file" which will contain the sequences
## after the pair-end of forward and reverse sequences which have a quality
## score higher than 40 (-- score-min=40)
## score higher than 40 ("-- score-min=40")
## to only conserve the sequences which have been aligned, use obigrep :
obigrep -p 'mode!="joined"' AIMI-199.fastq > AIMI-199.ali.fastq
obigrep -p 'mode!="joined"' AIMI-200.fastq > AIMI-200.ali.fastq
## to only conserve the sequences which have been aligned, use "obigrep" :
obigrep -p 'mode!="joined"' Med.fastq > Med.ali.fastq
obigrep -p 'mode!="joined"' Atl.fastq > Atl.ali.fastq
## -p requires a python expression
## "-p" requires a python expression
## the unaligned sequences are notified with mode="joined" by illuminapairedend
## whereas the aligned sequences are notified with mode="aligned"
## the unaligned sequences are notified with "mode='joined'" by
## "illuminapairedend" whereas the aligned sequences are notified with
## "mode='aligned'"
## so here python creates new datasets (.ali.fastq) which only contain the
## so here python creates new datasets (".ali.fastq") which only contain the
## sequences notified "aligned"
########################################################################
#STEP 3 : Demultiplexing
## to be able to compare the sequences next, you need to remove tags and primers,
## and to use the function ngsfilter :
ngsfilter -t mullus_surmuletus_data/AIMI_199_corr_tags.txt -u AIMI-199.unidentified.fastq AIMI-199.ali.fastq > AIMI-199.ali.assigned.fastq
ngsfilter -t mullus_surmuletus_data/AIMI_200_corr_tags.txt -u AIMI-200.unidentified.fastq AIMI-200.ali.fastq > AIMI-200.ali.assigned.fastq
## and to use the function "ngsfilter":
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Med.unidentified.fastq Med.ali.fastq > Med.ali.assigned.fastq
ngsfilter -t mullus_surmuletus_data/Atl_corr_tags.txt -u Atl.unidentified.fastq Atl.ali.fastq > Atl.ali.assigned.fastq
## new files are created :
## .unidentified.fastq files contain the sequences that were not assigned
## ".unidentified.fastq" files contain the sequences that were not assigned
## whith a correct tag
## .ali.assigned.fastq files contain the sequences that were assigned with
## ".ali.assigned.fastq" files contain the sequences that were assigned with
## a correct tag, so it contains only the barcode sequences
## separate your .ali.assigned.fastq files depending on their samples,
## separate your ".ali.assigned.fastq" files depending on their samples,
## in placing them in a dedicated folder (useful for next steps) :
mkdir samples
## create the folder
mv -t samples AIMI-199.ali.assigned.fastq AIMI-200.ali.assigned.fastq
mv -t samples Med.ali.assigned.fastq Atl.ali.assigned.fastq
## place the latests .fastq files in the folder
## place the latests ".fastq" files in the folder
cd samples
obisplit -t experiment --fastq AIMI-199.ali.assigned.fastq
obisplit -t experiment --fastq AIMI-200.ali.assigned.fastq
obisplit -t sample --fastq Med.ali.assigned.fastq
obisplit -t sample --fastq Atl.ali.assigned.fastq
## separation of the files depending on their sample
mv -t ./dada2_and_obitools AIMI-199.ali.assigned.fastq AIMI-200.ali.assigned.fastq
mv -t ./dada2_and_obitools Med.ali.assigned.fastq Atl.ali.assigned.fastq
## removing the original files from the folder
obiuniq -m sample Aquarium_2.fastq > Aquarium_2.uniq.fasta
obiannotate -k count -k merged_sample Aquarium_2.uniq.fasta > Aquarium_2.1.uniq.fasta
obigrep -l 20 -p 'count>=10' Aquarium_2.1.uniq.fasta > Aquarium_2.grep.fasta
obiclean -s merged_sample -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
obigrep -p 'obiclean_internalcount==0' Aquarium_2.clean.fasta > Aquarium_2.clean.grep.fasta
obiannotate -k count Aquarium_2.clean.grep.fasta > Aquarium_2.1.clean.grep.fasta
########################################################################
#STEP 4 : Dereplication
## "obiniq" permits to eliminate the replications of each sequence, returning
## only the amplicons and their abundance :
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta
########################################################################
#STEP 5 : Filtering
## "obigrep" is a command useful for filtering, considering the length of the
## sequence and his abundance for example :
obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
## "-l 20" option filters sequences with a length shorter than 20 bp
## "-p "'count>=10'" option filters sequences with an abundance inferior to 10
########################################################################
#STEP 6 : Elimination of PCR errors
## eliminate the PCR errors with "obiclean" :
obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
## here, the command returns only the sequences tagged "head" by the algorithm,
## and the ratio retained is 0.05
########################################################################
#STEP 7 : Taxonomic assignment
## assign sequences with their corresponding taxon according to the reference
## database :
ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
## only the sequences with a similarity score higher than 0.5 are annotated
## to modify the attributes, use "obiannotate" :
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
## the only attribute "count" is conserved
########################################################################
#STEP 8 : Gathering in OTU
## use the swarm command to make the gathering :
swarm -z -d 1 -o stats_Aq2.txt -w pipeline3_Aq2.fasta < pipeline1_Aq2.tag_1.fasta
## la base de reference des individus rougets, et une base de données taxonomique du laboratoire rassemblant les especes de l’infra-classe Teleoste
ecotag -m 0.5 -d /media/bmace/LaCie/projets/only_obitools/embl_std -R /media/bmace/LaCie/projets/only_obitools/base_ref_finale_formated.fasta Aquarium_2.1.clean.grep.fasta > Aquarium_2.tag.fasta
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.1.tag.fasta
#bioedit
## "-z" option permits to accept the abundance in the header, provided that
## there is no space in the header and that the value is preceded by "size="
obitab -o --output-field-separator=";" pipeline1_Aq2.fasta > pipeline1_Aq2.csv
## "-d" is the maximal number of differences tolerated between 2 sequences to be
## gathered in the same OTU
sumatra -t 1 pipeline1_Aq2.fasta /media/bmace/LaCie/projets/only_obitools/base_ref_finale_formated.fasta > pipeline1_Aq2.txt
## "-o" option returns a ".txt" file in which each line corresponds to an OTU
## with all the amplicons belonging to this OTU
#modifier pipeline1_Aq2.fasta pour avoir le bon format pour swarm en pipeline1_Aq2.1.fasta
swarm -z -d 0 -w pipeline1_Aq2.2.fasta -o /dev/null pipeline1_Aq2.1.fasta #si besoin de dérépliquer
swarm -z -d 1 -f -t 10 -o stats_Aq2.txt -w pipeline3_Aq2.fasta < pipeline1_Aq2.2.fasta
## "-w" option gives a "fasta" file with the representative sequence of each OTU
#modifier avec le filtrage d'Elbrecht
sumatra -t 1 pipeline3_Aq2.fasta /media/bmace/LaCie/projets/swarm_and_obitools/base_ref_finale_formated.fasta > pipeline3_Aq2.txt
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment