Commit 7d8bdb32 authored by peguerin's avatar peguerin
Browse files

main pipeline update

parent 4cb5467c
......@@ -5,8 +5,8 @@ Only_obitools pipeline
1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
1. [Requirements](#21-requirements)
2. [Optional components](#22-optional-components)
1. [Requirements](#21-requirements)
2. [Optional components](#22-optional-components)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
......@@ -29,6 +29,7 @@ programs and libraries you absolutely need are:
- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [GNU Parallel](https://www.gnu.org/software/parallel/)
## 2.2. Optional components
......@@ -48,15 +49,30 @@ verbose as possible — e.g. include the command line, etc
# 4. Running the pipeline
Quickstart
* open a shell
* make a folder, name it yourself, I named it workdir
1. create a new folder for nextflow to work in
2. switch to this new folder
3. open a shell
4. type in `curl -fsSL get.nextflow.io | bash` to download nextflow into this folder
5. make sure that the programs stated in the Requirements section below are installed on your machine. After nextflow is downloaded, replace all the "YOUR_***" parts in the following command with your own paths
```
mkdir workdir
cd workdir
```
* clone the project and switch to the main folder, it's your working directory
6. run your command : `./nextflow run main.nf`
```
git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd only_obitools
```
that's it ! The pipeline is running and crunching your data. Look for the overview.txt or. overview_new.txt in your output folder after the pipeline is finished
* define 2 folders :
- folder which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* run the pipeline :
```
bash pipeline.sh /path/to/data /path/to/baseofreference
```
order of arguments is important :
1. absolute path to the folder which contains paired-end raw reads files and sample description file
2. absolute path to the folder which contains reference database files
params.str = 'Hello world!'
process echostr {
"""
echo une string
"""
output:
file "coucou.txt" into record
//when:
// < condition >
script:
"""echo '${params.str}' > coucou.txt
"""
}
process readfile {
input:
file tt from record
output:
stdout result
"""
cat $tt
"""
}
result.subscribe { println it }
params.workingfolder="/media/superdisk/edna/training/peg/gitlab_test/only_obitools"
params.datafolder="/media/superdisk/edna/donnees/rhone_test"
sequences= Channel.fromFilePairs(params.datafolder+"/*_R{1,2}.fastq.gz",flat:true)
barcodes=Channel.fromPath(params.datafolder+"/*.dat")
process illuminapairedend {
"""
[t=2h]paired end alignment then keep reads with quality > 40
"""
input:
set val(id), file(R1_fastq), file(R2_fastq) from sequences
output:
file fastqMerged into fastqMergeds
script:
"""
illuminapairedend -r $R2_fastq $R1_fastq --score-min=40 > fastqMerged
"""
}
process remove_unaligned {
"""
[t=1h]remove unaligned sequence records
"""
input:
file fastqMerged from fastqMergeds
output:
file mergedAligned into mergedAligneds
script:
"""
obigrep -p 'mode!="joined"' $fastqMerged > mergedAligned
"""
}
process assign_sequences {
"""
[t=6h]assign each sequence record to the corresponding sample/marker combination
"""
input:
file mergedAligned from mergedAligneds
file barcode from barcodes
output:
file assignedMerged into assigedMergeds
file unassignedMerged into unassignedMergeds
script:
"""
ngsfilter -t $barcode -u unassignedMerged $mergedAligned --fasta-output > assignedMerged
"""
}
process split_sequences {
"""
split the input sequence file in a set of subfiles according to the values of attribute "sample"
"""
input:
file assignedMerged from assigedMergeds
output:
file 'sample_*.fasta' into demultiplexed mode flatten
script:
"""
obisplit -p "sample_" -t sample --fasta $assignedMerged
"""
}
process dereplicate {
input:
file sampleSplit from demultiplexed
output:
dereplicated into dereplicateds
script:
"""
#dereplicate reads into uniq sequences
obiuniq -m sample $sampleSplit > dereplicated
"""
}
##definir les variables globales
EDNA_PATH=/media/superdisk/edna
DATA_PATH="$EDNA_PATH"/donnees/rhone_all
##liste des fichiers fastq
###############################################################################
##define global variables
### absolute path to the folder which contains paired-end raw reads files and sample description file
DATA_PATH=$1
if [[ -z "$DATA_PATH" ]]
then
EDNA_PATH=/media/superdisk/edna
DATA_PATH="$EDNA_PATH"/donnees/rhone_all
fi
### absolute path to the folder which contains reference database files
BDR_PATH=$2
if [[ -z "$BDR_PATH" ]]
then
EDNA_PATH=/media/superdisk/edna
BDR_PATH="$EDNA_PATH"/donnees/rhone_all
fi
###############################################################################
## list fastq files
for i in `ls "$DATA_PATH"/*_R1.fastq.gz`;
do
basename $i | cut -d "." -f 1 | sed 's/_R1//g'
......@@ -11,11 +25,14 @@ for i in `ls "$DATA_PATH"/*dat`;
do
echo $i
done > liste_dat
##liste des fichiers fastq et dat correspondant
## list of fastq files and corresponding sample description files
paste liste_fq liste_dat > liste_fq_dat
rm liste_fq liste_dat
##ecriture du script sh de toutes les commandes du pipeline sur chaque fq/dat
## writing bash script with all commands for each pair of fastq and corresponding .dat files
while IFS= read -r var
do
echo "bash pipeline_single.sh "$var
echo "bash pipeline_single.sh "$var" "$DATA_PATH" "$BDR_PATH
done < liste_fq_dat > fq_dat_cmd.sh
###############################################################################
## run in parallel each command
parallel < fq_data_cmd.sh
##definir les variables globales
EDNA_PATH=/media/superdisk/edna
#pref_fastq="161124_SND393_A_L005_GWM-849"
pref_fastq=$1
#pref="all_rhone"
###############################################################################
##define global variable
### absolute path to the folder which contains paired-end raw reads files and sample description file
DATA_PATH=$3
### prefix of all the files generated by this run
pref=$1
pref_bdr="std"
#R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
#R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
#sample_description_file=$EDNA_PATH/donnees/rhone_all/MB1016K_Teleo.dat
### absolute path of paired end fastq.gz files
R1_fastq="$DATA_PATH"/"$pref"_R1.fastq.gz
R2_fastq="$DATA_PATH"/"$pref"_R2.fastq.gz
### absolute path to the corresponding sample description file of this run
sample_description_file=$2
base_dir=$EDNA_PATH/donnees/basedereference
main_dir=$EDNA_PATH/working/only_obitools/rhone_all/main
fin_dir=$EDNA_PATH/working/only_obitools/rhone_all/final
### absolute path to the folder which contains reference database files
base_dir=$4
### path to the folder which stores intermediate and temporary results
main_dir=$(pwd)/main
### path to the folder which contains final results tables for this run
fin_dir=$(pwd)/final
###############################################################################
## running pipeline
##[t=2h]paired end alignment then keep reads with quality > 40
illuminapairedend -r $R2_fastq $R1_fastq --score-min=40 > $main_dir/"$pref".fastq
##[t=1h]remove unaligned sequence records
......@@ -22,9 +25,8 @@ obigrep -p 'mode!="joined"' $main_dir/"$pref".fastq > $main_dir/"$pref".ali.fast
ngsfilter -t $sample_description_file -u $main_dir/"$pref"_unidentified.fastq $main_dir/"$pref".ali.fastq --fasta-output > $main_dir/"$pref".ali.assigned.fasta
##split the input sequence file in a set of subfiles according to the values of attribute `sample`
obisplit -p $main_dir/"$pref"_sample_ -t sample --fasta $main_dir/"$pref".ali.assigned.fasta
##liste file of samples
## write bash script to run in PARALLEL
all_samples_parallel_cmd_sh=$main_dir/"$pref"_sample_parallel_cmd.sh
##PARALLEL
echo "" > $all_samples_parallel_cmd_sh
for sample in `ls $main_dir/"$pref"_sample_*.fasta`;
do
......@@ -51,7 +53,7 @@ all_sample_sequences_uniq="${all_sample_sequences_clean/.fasta/.uniq.fasta}"
obiuniq -m sample $all_sample_sequences_clean > $all_sample_sequences_uniq
##Assign each sequence to a taxon
all_sample_sequences_tag="${all_sample_sequences_uniq/.fasta/.tag.fasta}"
ecotag -d "$base_dir"/embl_"$pref_bdr" -R $base_dir/db_"$pref_bdr".fasta $all_sample_sequences_uniq > $all_sample_sequences_tag
ecotag -d "$base_dir"/embl_* -R $base_dir/db_*.fasta $all_sample_sequences_uniq > $all_sample_sequences_tag
##Some unuseful attributes can be removed at this stage
all_sample_sequences_ann="${all_sample_sequences_tag/.fasta/.ann.fasta}"
obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount \
......@@ -71,4 +73,3 @@ obitab -o $all_sample_sequences_sort > $fin_dir/"$pref".csv
##definir les variables globales
EDNA_PATH=/media/superdisk/edna
#pref_fastq="161124_SND393_A_L005_GWM-849"
pref_fastq=$1
#pref="all_rhone"
pref=$1
pref_bdr="std"
#R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
#R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
#sample_description_file=$EDNA_PATH/donnees/rhone_all/MB1016K_Teleo.dat
sample_description_file=$2
base_dir=$EDNA_PATH/donnees/basedereference
main_dir=$EDNA_PATH/working/only_obitools/rhone_all/main
fin_dir=$EDNA_PATH/working/only_obitools/rhone_all/final
all_samples_parallel_cmd_sh=$main_dir/"$pref"_sample_parallel_cmd.sh
##PARALLEL
echo "" > $all_samples_parallel_cmd_sh
for sample in `ls $main_dir/"$pref"_sample_*.fasta`;
do
sample_sh="${sample/.fasta/_cmd.sh}"
echo "bash "$sample_sh >> $all_samples_parallel_cmd_sh
dereplicated_sample="${sample/.fasta/.uniq.fasta}"
###only sequence more than 20bp with no ambiguity IUAPC with total coverage greater than 10 reads
good_sequence_sample="${dereplicated_sample/.fasta/.l20.fasta}"
###Clean the sequences for PCR/sequencing errors (sequence variants)
r_sequence_sample="${good_sequence_sample/.fasta/.r005.fasta}"
###Remove sequence which are classified as 'internal' by obiclean
clean_sequence_sample="${r_sequence_sample/.fasta/.clean.fasta}"
done
parallel < $all_samples_parallel_cmd_sh
all_sample_sequences_clean=$main_dir/"$pref"_all_sample_clean.fasta
cat $main_dir/"$pref"_sample_*.uniq.l20.r005.clean.fasta > $all_sample_sequences_clean
##dereplicate and merge samples together
all_sample_sequences_uniq="${all_sample_sequences_clean/.fasta/.uniq.fasta}"
##Assign each sequence to a taxon
ecotag -d "$base_dir"/embl_"$pref_bdr" -R $base_dir/db_"$pref_bdr".fasta $all_sample_sequences_uniq > $all_sample_sequences_tag
all_sample_sequences_tag="${all_sample_sequences_uniq/.fasta/.tag.fasta}"
##Some unuseful attributes can be removed at this stage
all_sample_sequences_ann="${all_sample_sequences_tag/.fasta/.ann.fasta}"
obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount \
--delete-tag=obiclean_count --delete-tag=obiclean_singletoncount \
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
--delete-tag=obiclean_head --delete-tag=obiclean_headcount \
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=obiclean_status \
--delete-tag=seq_length_ori --delete-tag=sminL --delete-tag=sminR \
--delete-tag=reverse_score --delete-tag=reverse_primer --delete-tag=reverse_match --delete-tag=reverse_tag \
--delete-tag=forward_tag --delete-tag=forward_score --delete-tag=forward_primer --delete-tag=forward_match \
--delete-tag=tail_quality --with-taxon-at-rank=class --delete-tag=order $all_sample_sequences_tag > $all_sample_sequences_ann
##The sequences can be sorted by decreasing order of count
all_sample_sequences_sort="${all_sample_sequences_ann/.fasta/.sort.fasta}"
obisort -k count -r $all_sample_sequences_ann > $all_sample_sequences_sort
##generate a table final results
obitab -o $all_sample_sequences_sort > $fin_dir/"$pref".csv
##definir les variables globales
EDNA_PATH=/media/superdisk/edna
#pref_fastq="161124_SND393_A_L005_GWM-849"
pref_fastq=$1
#pref="all_rhone"
pref=$1
pref_bdr="std"
#R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
#R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
R1_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R1.fastq.gz
R2_fastq="$EDNA_PATH"/donnees/rhone_all/"$pref_fastq"_R2.fastq.gz
#sample_description_file=$EDNA_PATH/donnees/rhone_all/MB1016K_Teleo.dat
sample_description_file=$2
base_dir=$EDNA_PATH/donnees/basedereference
main_dir=$EDNA_PATH/working/only_obitools/rhone_all/main
main_dir2=$EDNA_PATH/working/only_obitools/rhone_all/main_spygen
fin_dir=$EDNA_PATH/working/only_obitools/rhone_all/final_spygen
all_samples_parallel_cmd_sh=$main_dir/"$pref"_sample_parallel_cmd.sh
##PARALLEL
#echo "" > $all_samples_parallel_cmd_sh
#for sample in `ls $main_dir/"$pref"_sample_*.fasta`;
#do
#sample_sh="${sample/.fasta/_cmd.sh}"
#echo "bash "$sample_sh >> $all_samples_parallel_cmd_sh
#dereplicated_sample="${sample/.fasta/.uniq.fasta}"
###only sequence more than 20bp with no ambiguity IUAPC with total coverage greater than 10 reads
#good_sequence_sample="${dereplicated_sample/.fasta/.l20.fasta}"
###Clean the sequences for PCR/sequencing errors (sequence variants)
#r_sequence_sample="${good_sequence_sample/.fasta/.r005.fasta}"
###Remove sequence which are classified as 'internal' by obiclean
#clean_sequence_sample="${r_sequence_sample/.fasta/.clean.fasta}"
#done
#parallel < $all_samples_parallel_cmd_sh
all_sample_sequences_clean=$main_dir2/"$pref"_all_sample_clean.fasta
cat $main_dir/"$pref"_sample_*.uniq.l20.r005.clean.fasta > $all_sample_sequences_clean
##dereplicate and merge samples together
all_sample_sequences_uniq="${all_sample_sequences_clean/.fasta/.uniq.fasta}"
obiuniq -m sample $all_sample_sequences_clean > $all_sample_sequences_uniq
##Assign each sequence to a taxon
all_sample_sequences_tag="${all_sample_sequences_uniq/.fasta/.tag.fasta}"
ecotag -d "$base_dir"/embl_"$pref_bdr" -R "$EDNA_PATH"/donnees/rhone_all/teleo_V1_0_VM.fasta $all_sample_sequences_uniq > $all_sample_sequences_tag
##Some unuseful attributes can be removed at this stage
all_sample_sequences_ann="${all_sample_sequences_tag/.fasta/.ann.fasta}"
obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount \
--delete-tag=obiclean_count --delete-tag=obiclean_singletoncount \
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
--delete-tag=obiclean_head --delete-tag=obiclean_headcount \
--delete-tag=id_status --delete-tag=obiclean_status \
--delete-tag=seq_length_ori --delete-tag=sminL --delete-tag=sminR \
--delete-tag=reverse_score --delete-tag=reverse_primer --delete-tag=reverse_match --delete-tag=reverse_tag \
--delete-tag=forward_tag --delete-tag=forward_score --delete-tag=forward_primer --delete-tag=forward_match \
--delete-tag=tail_quality $all_sample_sequences_tag > $all_sample_sequences_ann
##The sequences can be sorted by decreasing order of count
all_sample_sequences_sort="${all_sample_sequences_ann/.fasta/.sort.fasta}"
obisort -k count -r $all_sample_sequences_ann > $all_sample_sequences_sort
##generate a table final results
obitab -o $all_sample_sequences_sort > $fin_dir/"$pref".csv
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment