Commit 380c0efc authored by Bastien Macé's avatar Bastien Macé
Browse files

script adjust

parent aeed260a
...@@ -11,7 +11,7 @@ _________________________________ ...@@ -11,7 +11,7 @@ _________________________________
* [Installation](#installation) * [Installation](#installation)
+ [Preliminary steps for OBITools](#preliminary-steps-for-obitools) + [Preliminary steps for OBITools](#preliminary-steps-for-obitools)
+ [Preliminary steps for swarm](#preliminary-steps-for-swarm) + [Preliminary steps for swarm](#preliminary-steps-for-swarm)
* [STEP 1 : Pair-end sequencing (OBITools)](#step1) * [STEP 1 : Pair-ended merging (OBITools)](#step1)
* [STEP 2 : Demultiplexing (OBITools)](#step2) * [STEP 2 : Demultiplexing (OBITools)](#step2)
* [STEP 3 : Dereplication (OBITools)](#step3) * [STEP 3 : Dereplication (OBITools)](#step3)
* [STEP 4 : Filtering (OBITools)](#step4) * [STEP 4 : Filtering (OBITools)](#step4)
...@@ -31,7 +31,7 @@ For that, we will use the OBITools commands and swarm. ...@@ -31,7 +31,7 @@ For that, we will use the OBITools commands and swarm.
- [OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home) are commands written in python - [OBITools](https://git.metabarcoding.org/obitools/obitools/wikis/home) are commands written in python
- [swarm](https://github.com/torognes/swarm) is a command written in C++ and which can be used with a Unix shell - [swarm](https://github.com/torognes/swarm) is a command written in C++ and which can be used with a Unix shell
In this example, 2 datasets are used, because the study analyzes the sequencing of 2 tiles. In this example, two datasets are used because the study analyzes the result of a pair-end sequencing (Example of filtrated eDNA from aquarium seawater).
## Installation ## Installation
...@@ -87,7 +87,7 @@ cp -r ./bin/swarm /usr/local/bin ...@@ -87,7 +87,7 @@ cp -r ./bin/swarm /usr/local/bin
``` ```
<a name="step1"></a> <a name="step1"></a>
## STEP 1 : Pair-end sequencing (OBITools) ## STEP 1 : Pair-ended merging
First, unzip your data in your shell if you need : First, unzip your data in your shell if you need :
``` ```
...@@ -99,95 +99,94 @@ Activate your environment in your shell : ...@@ -99,95 +99,94 @@ Activate your environment in your shell :
conda activate obitools conda activate obitools
``` ```
Use the function _illuminapairedend_ to make the pair-end sequencing from the forward and reverse strands of the sequences you have in your data. In other words, the function aligns the complementary strands in order to get a longer sequence. In fact, during PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced. Use the command _illuminapairedend_ to make the pair-ended merging from the forward and reverse strands of the sequences you have in your data. The command aligns the complementary strands in order to get a longer sequence. In fact, after PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.
``` ```
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Med_R1.fastq mullus_surmuletus_data/Med_R2.fastq > Med.fastq illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_R1.fastq mullus_surmuletus_data/Aquarium_2_R2.fastq > Aquarium_2.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Atl_R1.fastq mullus_surmuletus_data/Atl_R2.fastq > Atl.fastq # a new .fastq file is created, it contains the sequences after the merging of forward and reverse strands
# a new .fastq file is created, it contains the sequences after the pair-end of forward and reverse sequences which have a quality score higher than 40 (-- score-min=40) # alignments which have a quality score higher than 40 (-- score-min=40) are merged and annotated "aligned", while alignemnts with a lower quality score are concatenated and annotated "joined"
``` ```
To only conserve the sequences which have been aligned, use _obigrep_ : To only conserve the sequences which have been merged, use _obigrep_ :
``` ```
obigrep -p 'mode!="joined"' Med.fastq > Med.ali.fastq obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
obigrep -p 'mode!="joined"' Atl.fastq > Atl.ali.fastq
# -p requires a python expression # -p requires a python expression
# the unaligned sequences are notified with mode="joined" whereas the aligned sequences are notified with mode="aligned" # python creates a new dataset (.ali.fastq) which only contains the sequences annotated "aligned"
# so here python creates new datasets (.ali.fastq) which only contain the sequences notified "aligned"
``` ```
<a name="step2"></a> <a name="step2"></a>
## STEP 2 : Demultiplexing (OBITools) ## STEP 2 : Demultiplexing
The _.txt_ files assign each sequence to its sample thanks to its tag because each tag corresponds to a reverse or a forward sequence from a sample. A _.txt_ file assigns each sequence to its sample thanks to its tag, because each tag corresponds to a reverse or a forward sequence from a sample.
To be able to compare the sequences next, you need to remove tags and primers, and to use the function _ngsfilter_ : To compare the sequences next, you need to remove the tags and the primers, by using the _ngsfilter_ command :
``` ```
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Med.unidentified.fastq Med.ali.fastq > Med.ali.assigned.fastq ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
ngsfilter -t mullus_surmuletus_data/Atl_corr_tags.txt -u Atl.unidentified.fastq Atl.ali.fastq > Atl.ali.assigned.fastq # the command creates new files :
# the function creates new files : # ".unidentified.fastq" file contains the sequences that were not assigned whith a correct tag
# ".unidentified.fastq" files contain the sequences that were not assigned whith a correct tag # ".ali.assigned.fastq" file contains the sequences that were assigned with a correct tag, so they contain only the barcode sequences
# ".ali.assigned.fastq" files contain the sequences that were assigned with a correct tag, so they contain only the barcode sequences
``` ```
Then, separate your _.ali.assigned.fastq_ files depending on their samples in placing them in a dedicated folder (useful for next steps) : Then, separate your _.ali.assigned.fastq_ files depending on their samples in placing them in a dedicated folder (useful for next steps) :
``` ```
mkdir samples mkdir samples
# create the folder # create the folder
mv -t samples Med.ali.assigned.fastq Atl.ali.assigned.fastq mv -t samples Aquarium_2.ali.assigned.fastq
# place the latests .fastq files in the folder # place the latests ".fastq" files in the folder
cd samples cd samples
obisplit -t samples --fastq sample/Med.ali.assigned.fastq obisplit -t samples --fastq sample/Aquarium_2.ali.assigned.fastq
obisplit -t samples --fastq sample/Atl.ali.assigned.fastq # separate the files depending on their samples
# separation of the files depending on their samples mv -t ./swarm_and_obitools Aquarium_2.ali.assigned.fastq
mv -t ./dada2_and_obitools Med.ali.assigned.fastq Atl.ali.assigned.fastq # remove the original files from the folder
# removing the original files from the folder
``` ```
Now you have as many files as samples, containing pair-ended and demultiplexed sequences. Now you have as many files as samples, containing merged pair-ended and demultiplexed sequences.
<a name="step3"></a> <a name="step3"></a>
## STEP 3 : Dereplication (OBITools) ## STEP 3 : Dereplication
Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header : Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header with _obiuniq_ :
``` ```
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta
``` ```
This command also transforms _fastq_ files into fasta format.
<a name="step4"></a> <a name="step4"></a>
## STEP 4 : Filtering (OBITools) ## STEP 4 : Filtering
The _obigrep_ command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons : The _obigrep_ command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons :
``` ```
obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
# "-l 20" option filters sequences with a length shorter than 20 bp # "-l 20" option eliminates sequences with a length shorter than 20 bp
# "-p "'count>=10'" option filters sequences with an abundance inferior to 10 # "-p 'count>=10'" option eliminates sequences with an abundance inferior to 10
``` ```
<a name="step5"></a> <a name="step5"></a>
## STEP 5 : Elimination of PCR errors (OBITools) ## STEP 5 : Elimination of PCR errors
_obiclean_ is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the amplicons, and calculates the ratio between the abundance of the 2 amplicons. If there is only one dissimilarity (parameter by default, but can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one. _obiclean_ is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the amplicons, and calculates the ratio between the abundance of the two amplicons aligned. If there is only one dissimilarity (parameter by default, can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one.
Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton". Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".
``` ```
obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
# here, the command returns only the sequences tagged "head" by the algorithm, and the ratio retained is 0.05 # here, the command returns only the sequences tagged "head" by the algorithm, and the chosen ratio is 0.05
``` ```
<a name="step6"></a> <a name="step6"></a>
## STEP 6 : Taxonomic assignment (OBITools) ## STEP 6 : Taxonomic assignment
_ecotag_ is a command which permits to assign each head amplicon to its corresponding taxon. It requires to first having used [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) with your primers used to amplify your sequences. This command have given a file containing the taxons which can potentially be amplified by the selected primers. _ecotag_ permits to assign your sequences to one of these taxons, with a minimum similarity score fixed at a chosen value, and so to be sure that your sequences come from the correct taxon. However, this step is optionnal is your primers are specific enough.
_ecotag_ is the command which permits to assign each head amplicon to its corresponding taxon. The algorithm compares the amplicons with the sequences from the reference database. If the similarity score is higher than the threshold chosen, the amplicon is assigned to its "taxid" thanks to the taxonomy database.
``` ```
ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
# only the sequences with a similarity score higher than 0.5 are annotated # only the sequences with a similarity score higher than 0.95 are annotated to their corresponding taxon
``` ```
Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance : Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance :
``` ```
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
# the only attribute "count" is conserved # only the attribute "count" is conserved
``` ```
<a name="step7"></a> <a name="step7"></a>
...@@ -201,7 +200,7 @@ _swarm_ gathers the sequences in OTU thanks to this algorithm : ...@@ -201,7 +200,7 @@ _swarm_ gathers the sequences in OTU thanks to this algorithm :
- The abundance of the OTU is constituted by adding the abundances of each sequence included in the OTU - The abundance of the OTU is constituted by adding the abundances of each sequence included in the OTU
``` ```
swarm -z -d 1 -o stats_Aq2.txt -w pipeline3_Aq2.fasta < pipeline1_Aq2.tag_1.fasta swarm -z -d 1 -o stats_Aquarium_2.txt -w Aquarium_2.end.fasta < Aquarium_2.tag_1.fasta
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size=" # "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU # "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU # "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
......
...@@ -9,7 +9,7 @@ ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml ...@@ -9,7 +9,7 @@ ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML conda env create -f $ENVYAML
######################################################################## ########################################################################
#STEP 2 : Pair-end sequencing #STEP 2 : Pair-ended merging
## unzip your data if you need : ## unzip your data if you need :
unzip mullus_surmuletus_data.zip unzip mullus_surmuletus_data.zip
...@@ -17,41 +17,38 @@ unzip mullus_surmuletus_data.zip ...@@ -17,41 +17,38 @@ unzip mullus_surmuletus_data.zip
## activate your environment : ## activate your environment :
conda activate obitools conda activate obitools
## use the function "illuminapairedend" to make the pair-end sequencing from ## use the command "illuminapairedend" to make the pair-ended merging
## the forward and reverse sequences you have in your data : ## from the forward and reverse sequences you have in your data :
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Med_R1.fastq mullus_surmuletus_data/Med_R2.fastq > Med.fastq illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_R1.fastq mullus_surmuletus_data/Aquarium_2_R2.fastq > Aquarium_2.fastq
illuminapairedend --score-min=40 -r mullus_surmuletus_data/Atl_R1.fastq mullus_surmuletus_data/Atl_R2.fastq > Atl.fastq
## this function creates a new ".fastq file" which will contain the sequences ## this command creates a new ".fastq" file which will contain the
## after the pair-end of forward and reverse sequences which have a quality ## sequences after the merging of forward and reverse strands
## score higher than 40 ("-- score-min=40")
## alignments which have a quality score higher than 40
## (-- score-min=40) are merged and annotated "aligned", while
## alignemnts with a lower quality score are concatenated and
## annotated "joined"
## to only conserve the sequences which have been aligned, use "obigrep" : ## to only conserve the sequences which have been aligned, use "obigrep" :
obigrep -p 'mode!="joined"' Med.fastq > Med.ali.fastq obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
obigrep -p 'mode!="joined"' Atl.fastq > Atl.ali.fastq
## "-p" requires a python expression ## "-p" requires a python expression
## the unaligned sequences are notified with "mode='joined'" by ## python creates a new dataset (".ali.fastq") which only contains the
## "illuminapairedend" whereas the aligned sequences are notified with ## sequences annotated "aligned"
## "mode='aligned'"
## so here python creates new datasets (".ali.fastq") which only contain the
## sequences notified "aligned"
######################################################################## ########################################################################
#STEP 3 : Demultiplexing #STEP 3 : Demultiplexing
## to be able to compare the sequences next, you need to remove tags and primers, ## to compare the sequences next, you need to remove the tags and the
## and to use the function "ngsfilter": ## primers, by using the "ngsfilter" command
ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Med.unidentified.fastq Med.ali.fastq > Med.ali.assigned.fastq ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
ngsfilter -t mullus_surmuletus_data/Atl_corr_tags.txt -u Atl.unidentified.fastq Atl.ali.fastq > Atl.ali.assigned.fastq
## new files are created : ## new files are created :
## ".unidentified.fastq" files contain the sequences that were not assigned ## ".unidentified.fastq" file contains the sequences that were not
## whith a correct tag ## assigned whith a correct tag
## ".ali.assigned.fastq" files contain the sequences that were assigned with ## ".ali.assigned.fastq" file contains the sequences that were assigned
## a correct tag, so it contains only the barcode sequences ## with a correct tag, so it contains only the barcode sequences
## separate your ".ali.assigned.fastq" files depending on their samples, ## separate your ".ali.assigned.fastq" files depending on their samples,
## in placing them in a dedicated folder (useful for next steps) : ## in placing them in a dedicated folder (useful for next steps) :
...@@ -59,36 +56,36 @@ mkdir samples ...@@ -59,36 +56,36 @@ mkdir samples
## create the folder ## create the folder
mv -t samples Med.ali.assigned.fastq Atl.ali.assigned.fastq mv -t samples Aquarium_2.ali.assigned.fastq
## place the latests ".fastq" files in the folder ## place the latests ".fastq" files in the folder
cd samples cd samples
obisplit -t sample --fastq Med.ali.assigned.fastq obisplit -t sample --fastq Aquarium_2.ali.assigned.fastq
obisplit -t sample --fastq Atl.ali.assigned.fastq
## separation of the files depending on their sample ## separate the files depending on their sample
mv -t ./dada2_and_obitools Med.ali.assigned.fastq Atl.ali.assigned.fastq mv -t ./only_obitools Aquarium_2.ali.assigned.fastq
## removing the original files from the folder ## remove the original files from the folder
######################################################################## ########################################################################
#STEP 4 : Dereplication #STEP 4 : Dereplication
## "obiniq" permits to eliminate the replications of each sequence, returning ## "obiuniq" permits to eliminate the replications of each sequence,
## only the amplicons and their abundance : ## returning only the amplicons and their abundance :
obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta obiuniq Aquarium_2.ali.assigned.fastq > Aquarium_2.uniq.fasta
######################################################################## ########################################################################
#STEP 5 : Filtering #STEP 5 : Filtering
## "obigrep" is a command useful for filtering, considering the length of the ## "obigrep" is a command useful for filtering, considering the length
## sequence and his abundance for example : ## of the sequence and his abundance for example :
obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
## "-l 20" option filters sequences with a length shorter than 20 bp ## "-l 20" option eliminates sequences with a length shorter than 20 bp
## "-p "'count>=10'" option filters sequences with an abundance inferior to 10 ## "-p 'count>=10'" option eliminates sequences with an abundance
## inferior to 10
######################################################################## ########################################################################
#STEP 6 : Elimination of PCR errors #STEP 6 : Elimination of PCR errors
...@@ -96,38 +93,39 @@ obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta ...@@ -96,38 +93,39 @@ obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
## eliminate the PCR errors with "obiclean" : ## eliminate the PCR errors with "obiclean" :
obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
## here, the command returns only the sequences tagged "head" by the algorithm, ## here, the command only returns the sequences tagged "head" by the
## and the ratio retained is 0.05 ## algorithm, and the chosen ratio is 0.05
######################################################################## ########################################################################
#STEP 7 : Taxonomic assignment #STEP 7 : Taxonomic assignment
## assign sequences with their corresponding taxon according to the reference ## assign sequences with their corresponding taxon according to the
## database : ## "ecoPCR" output :
ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
## only the sequences with a similarity score higher than 0.5 are annotated ## only the sequences with a similarity score higher than 0.95 are
## annotated to their corresponding taxon
## to modify the attributes, use "obiannotate" : ## to modify the attributes, use "obiannotate" :
obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
## the only attribute "count" is conserved ## only the attribute "count" is conserved
######################################################################## ########################################################################
#STEP 8 : Gathering in OTU #STEP 8 : Gathering in OTU
## use the swarm command to make the gathering : ## use the swarm command to make the gathering :
swarm -z -d 1 -o stats_Aq2.txt -w pipeline3_Aq2.fasta < pipeline1_Aq2.tag_1.fasta swarm -z -d 1 -o stats_Aquarium_2.txt -w Aquarium_2.end.fasta < Aquarium_2.tag_1.fasta
## "-z" option permits to accept the abundance in the header, provided that
## there is no space in the header and that the value is preceded by "size="
## "-d" is the maximal number of differences tolerated between 2 sequences to be ## "-z" option permits to accept the abundance in the header, provided
## gathered in the same OTU ## thet there is no space in the header and that the value is preceded
## by "size="
## "-o" option returns a ".txt" file in which each line corresponds to an OTU ## "-d" is the maximal number of differences tolerated between two
## with all the amplicons belonging to this OTU ## sequences to be gathered in the same OTU
## "-w" option gives a "fasta" file with the representative sequence of each OTU ## "-o" option returns a ".txt" file in which each line corresponds to
## an OTU with all the amplicons belonging to this OTU
#modifier avec le filtrage d'Elbrecht ## "-w" option gives a "fasta" file with the representative sequence of
## each OTU
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment