Commit e6913fdf authored by peguerin's avatar peguerin
Browse files

add step3.nf

parent 6897e7a1
......@@ -6,7 +6,7 @@ Only_obitools pipeline using NEXTFLOW
1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
1. [Requirements](#21-requirements)
2. [Downloading](#22-downloading)
2. [Initialisation](#22-initialisation)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
......@@ -16,6 +16,9 @@ Only_obitools pipeline using NEXTFLOW
Here, we reproduce the bioinformatics pipeline used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This pipeline is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.
This pipeline use the workflow management system [nextflow](https://www.nextflow.io/). So you will need to install it. If you don't want to use a workflow management system, an "only bash" version is alternatively available [here](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools).
# 2. Installation
## 2.1. Requirements
......@@ -26,8 +29,12 @@ programs and libraries you absolutely need are:
- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [Java 8 (or later)](https://www.nextflow.io/docs/latest/getstarted.html)
In addition, you will need a reference database for taxonomic assignment. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
## 2.2. Downloading
## 2.2. Initiatilisation
* open a shell
* make a folder, name it yourself, I named it workdir
......@@ -40,7 +47,9 @@ cd workdir
git clone http://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools.git
cd nextflow_obitools
```
* define 2 external folders :
- folder which contains reference database files. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https:///pythonhosted.org/OBITools/scripts/ngsfilter.html).
# 3. Reporting bugs
......@@ -67,9 +76,28 @@ curl -fsSL get.nextflow.io | bash
5. make sure that the programs stated in the Requirements section below are installed on your machine. After nextflow is downloaded, replace all the "YOUR_***" parts in the following command with your own paths
6. run your command
Demultiplexing and filtering of the eDNA metabarcoding raw data
```
./nextflow run scripts/step1.nf --datafolder 'path/to/fastq/and/dat/files'
```
Outputs are stored into newly created `work/` folder.
Concatenating sample by run id
```
bash scripts/step2.sh
```
Cleaned sequences for each run are stored into newly created `runs/` folder.
Taxonomic assignment and generating matrix species/sample for each run
```
./nextflow run scripts/step3.nf --db_ref /path/to/reference/database/and/prefix --db_fasta /path/to/reference/database/fasta/file
```
To build your own reference database see the details [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
Alternatively, you can run into one single command the whole pipeline by typing :
```
bash main.sh path/to/fastq/and/dat/files /path/to/reference/database/and/prefix /path/to/reference/database/fasta/file
```
that's it ! The pipeline is running and crunching your data. Look for the overview.txt or. overview_new.txt in your output folder after the pipeline is finished
## argument data path
DATA_FOLDER=$1
REFERENCE_DATABASE=$2
REFERENCE_DATABASE_FASTA=$3
## main pipeline
## demultiplexing and filtering of metabarcoding raw data
./nextflow run scripts/step1.nf --datafolder ${DATA_FOLDER}
## concatenate cleaned sample sequences by run id
bash scripts/step2.sh
## taxonomic assignment and generating matrix species/sample for each run
./nextflow run scripts/step3.nf --db_ref ${REFERENCE_DATABASE} --db_fasta ${REFERENCE_DATABASE_FASTA}
## done check work/ and runs/ folders for intermediate files and tables/ folder for final results
params.str = 'Hello world!'
params.datafolder="/media/superdisk/edna/donnees/rhone_test"
sequences= Channel.fromFilePairs(params.datafolder+"/*_R{1,2}.fastq.gz",flat:true)
barcodes=Channel.fromPath(params.datafolder+"/*.dat")
params.count=10
params.seq_length=20
params.obiclean_r=0.05
sequences= Channel.fromFilePairs(params.datafolder+"/*_R{1,2}.fastq.gz",flat:true)
barcodes=Channel.fromPath(params.datafolder+"/*.dat")
process illuminapairedend {
"""
[t=2h]paired end alignment then keep reads with quality > 40
......@@ -94,24 +94,21 @@ process seq_count_filter {
"""
}
process annotate_pcrerr {
"""
Clean the sequences for PCR/sequencing errors (sequence variants)
"""
input:
set val(sample), file("${sample}.u.filtered.fa") from goodlength_goodcounts
set val(sample), file("${sample}.u.filtered.fa") from goodlength_goodcounts.filter { sample, file -> file.size()>0 }
output:
set val(sample), file("${sample}.u.f.pcr_annotated.fa") into pcrerr_annotateds
script:
if (!file("${sample}.u.filtered.fa").isEmpty()){
script:
"""
obiclean -r ${params.obiclean_r} ${sample}.u.filtered.fa > ${sample}.u.f.pcr_annotated.fa
"""
}
}
process remove_internal {
process remove_internal {
"""
Remove sequence which are classified as 'internal' by obiclean
"""
......@@ -120,7 +117,7 @@ process remove_internal {
output:
set val(sample), file("${sample}.u.f.p.cleaned.fa") into cleaned_samples
script:
"""
"""
obigrep -p 'obiclean_internalcount == 0' ${sample}.u.f.pcr_annotated.fa > ${sample}.u.f.p.cleaned.fa
"""
}
......
process cat_samples {
"""
Concatenate sequences from each sample of the same run
"""
input:
set val(run) from sequences
set val(sample), file("${sample}.u.f.p.cleaned.fa") into cleaned_samples
output:
set val(run), file("${run}.fasta") into fastaruns
script:
"""
cat sample_${run}_*.u.f.p.cleaned.fa > ${run}.fasta
"""
}
WORK IN PROGRESS !
\ No newline at end of file
## collect list of run id
for f in `ls work/*/*/*.u.f.p.cleaned.fa`;
do basename $f | cut -d "." -f 1 ;done |sort | uniq > run.list
## concatenate sample files from the same run id
mkdir runs
cat run.list | while read RUN ;
do
cat work/*/*/${RUN}.u.f.p.cleaned.fa > runs/${RUN}.fasta
done
rm run.list
\ No newline at end of file
params.db_ref="/media/superdisk/edna/donnees/basedereference/embl_std"
params.db_fasta="/media/superdisk/edna/donnees/basedereference/db_embl_std.fasta"
fastaruns=Channel.fromPath("runs/*.fasta").map { file -> tuple(file.baseName, file) }
process dereplicate_runs {
"""
Dereplicate and merge samples together
"""
publishDir 'runs'
input:
set RUN_ID, file(fastarun) from fastaruns
output:
set RUN_ID, file("${RUN_ID}.uniq.fa") into fastaRunUniqs
script:
"""
obiuniq -m sample ${fastarun} > ${RUN_ID}.uniq.fa
"""
}
process assign_taxon {
"""
Assign each sequence to a taxon
"""
publishDir 'runs'
input:
set RUN_ID, file("${RUN_ID}.uniq.fa") from fastaRunUniqs
output:
set RUN_ID, file("${RUN_ID}.u.tag.fa") into assigneds
script:
"""
ecotag -d ${params.db_ref} -R ${params.db_fasta} ${RUN_ID}.uniq.fa > ${RUN_ID}.u.tag.fa
"""
}
process rm_attributes {
"""
Some unuseful attributes can be removed at this stage
"""
publishDir 'runs'
input:
set RUN_ID, file("${RUN_ID}.u.tag.fa") from assigneds
output:
set RUN_ID, file("${RUN_ID}.u.t.lessattributes.fa") into lessattributes
script:
"""
obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount \
--delete-tag=obiclean_count --delete-tag=obiclean_singletoncount \
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
--delete-tag=obiclean_head --delete-tag=obiclean_headcount \
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=obiclean_status \
--delete-tag=seq_length_ori --delete-tag=sminL --delete-tag=sminR \
--delete-tag=reverse_score --delete-tag=reverse_primer --delete-tag=reverse_match --delete-tag=reverse_tag \
--delete-tag=forward_tag --delete-tag=forward_score --delete-tag=forward_primer --delete-tag=forward_match \
--delete-tag=tail_quality ${RUN_ID}.u.tag.fa > ${RUN_ID}.u.t.lessattributes.fa
"""
}
process sort_runs {
"""
The sequences can be sorted by decreasing order of count
"""
publishDir 'runs'
input:
set RUN_ID, file("${RUN_ID}.u.t.lessattributes.fa") from lessattributes
output:
set RUN_ID, file("${RUN_ID}.u.t.l.sorted.fa") into sorteds
script:
"""
obisort -k count -r ${RUN_ID}.u.t.lessattributes.fa > ${RUN_ID}.u.t.l.sorted.fa
"""
}
process table_runs {
"""
Generate a table final results
"""
publishDir 'tables'
input:
set RUN_ID, file("${RUN_ID}.u.t.l.sorted.fa") from sorteds
output:
set RUN_ID, file("${RUN_ID}.csv") into runtables
script:
"""
obitab -o ${RUN_ID}.u.t.l.sorted.fa > ${RUN_ID}.csv
"""
}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment