Commit ca306b83 authored by peguerin's avatar peguerin
Browse files

Initial commit

parents
Only_obitools pipeline using SNAKEMAKE
======================================
# Table of contents
1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
-----------------
# 1. Introduction
Here, we reproduce the bioinformatics pipeline used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This pipeline is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.
# 2. Installation
In order to run "snakemake_only_obitools", you need a couple of programs. Most of
them should be available pre-compiled for your distribution. The
programs and libraries you absolutely need are:
- [python3](https://www.python.org/download/releases/3.0/)
- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [snakemake](https://bitbucket.org/snakemake/snakemake)
# 3. Reporting bugs
If you're sure you've found a bug — e.g. if one of my programs crashes
with an obscur error message, or if the resulting file is missing part
of the original data, then by all means submit a bug report.
I use [GitLab's issue system](https://gitlab.com/edna/only_obitools/issues)
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc
# 4. Running the pipeline
Quickstart
1. open a shell
2. make a folder, name it yourself, I named it workdir
`mkdir workdir
cd workdir`
2. clone the project and switch to the main folder, it's your working directory
`git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd snakemake_only_obitools`
3. define 2 folders :
- folder which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
4. run the pipeline :
`bash main.sh /path/to/fastq_dat_files /path/to/reference_database_folder 16`
Order of arguments is important : 1) path to the folder which2) path to the folder which contains paired-end raw reads files and sample description file ; number of cores to assign (here for instance 16 cores)
5. run the pipeline step by step :
open the file `main.sh` to see details
that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.
# arguments
FOLDER_INPUT=$1
REFERENCE_BASE=$2
CORES=$3
# working folders
mkdir -p assembled barcodes raw log samples runs tables
# link base reference
ln -s $REFERENCE_BASE bdr
# link raw sequences data
for i in `ls $FOLDER_INPUT*fastq.gz`;
do name=`basename $i` ;
ln -s $i raw/$name ;
done
# link barcode data
for i in `ls $FOLDER_INPUT*dat`;
do name=`basename $i` ;
ln -s $i barcodes/$name ;
done
# run pipeline
snakemake -s scripts/step1.sf -j $CORES --latency-wait 120
snakemake -s scripts/step2.sf -j $CORES --latency-wait 120
snakemake -s scripts/step3.sf -j $CORES --latency-wait 120
snakemake -s scripts/step4.sf -j $CORES --latency-wait 120
#configfile: "config.yaml"
RUNS, = glob_wildcards('raw/{run}_R1.fastq.gz')
BARCODES, = glob_wildcards('barcodes/{barcode}.dat')
#print(BARCODES)
#print(RUNS)
DICBARCODES={}
i=0
for bc in BARCODES:
DICBARCODES[RUNS[i]]="barcodes/"+bc+".dat"
i=i+1
#print(DICBARCODES)
rule all:
input:
expand('assembled/{run}/{run}.fastq', run=RUNS),
expand('assembled/{run}/{run}.ali.fastq', run=RUNS),
expand('assembled/{run}/{run}.ali.assigned.fastq', run=RUNS),
expand('assembled/{run}/{run}.unidentified.fastq', run=RUNS),
expand('log/remove_unaligned/{run}.log',run=RUNS),
expand('log/illuminapairedend/{run}.log',run=RUNS),
expand('log/assign_sequences/{run}.log',run=RUNS),
expand('log/split_sequences/{run}.log',run=RUNS)
### Paired end alignment then keep reads with quality > 40
rule illuminapairedend:
input:
R1='raw/{run}_R1.fastq.gz',
R2='raw/{run}_R2.fastq.gz'
output:
fq='assembled/{run}/{run}.fastq'
log:
'log/illuminapairedend/{run}.log'
shell:
'''illuminapairedend -r {input.R2} {input.R1} --score-min=40 > {output.fq} 2> {log}'''
### Remove unaligned sequence records
rule remove_unaligned:
input:
fq='assembled/{run}/{run}.fastq'
output:
ali='assembled/{run}/{run}.ali.fastq'
log:
'log/remove_unaligned/{run}.log'
shell:
'''obigrep -p 'mode!=\"joined\"' {input.fq} > {output.ali} 2> {log}'''
### Assign each sequence record to the corresponding sample/marker combination
rule assign_sequences:
input:
'assembled/{run}/{run}.ali.fastq',
lambda wildcards: DICBARCODES[wildcards.run]
output:
assign='assembled/{run}/{run}.ali.assigned.fastq',
unid='assembled/{run}/{run}.unidentified.fastq'
log:
'log/assign_sequences/{run}.log'
shell:
'''ngsfilter -t {input[1]} -u {output.unid} {input[0]} --fasta-output > {output.assign} 2> {log}'''
### Split the input sequence file in a set of subfiles according to the values of attribute `sample`
rule split_sequences:
input:
'assembled/{run}/{run}.ali.assigned.fastq'
params:
'samples/{run}_sample_'
log:
'log/split_sequences/{run}.log'
shell:
'''obisplit -p "{params}" -t sample --fasta {input} 2> {log}'''
SAMPLES, = glob_wildcards('samples/{sample}.fasta')
rule all:
input:
expand('samples/{sample}.uniq.fasta',sample=SAMPLES),
expand('samples/{sample}.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.r.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.c.r.l.u.fasta',sample=SAMPLES),
expand('log/dereplicate_samples/{sample}.log',sample=SAMPLES),
expand('log/goodlength_samples/{sample}.log',sample=SAMPLES),
expand('log/clean_pcrerr/{sample}.log',sample=SAMPLES),
expand('log/rm_internal_samples/{sample}.log',sample=SAMPLES)
### dereplicate reads into uniq sequences
rule dereplicate_samples:
input:
'samples/{sample}.fasta'
output:
'samples/{sample}.uniq.fasta'
log:
'log/dereplicate_samples/{sample}.log'
shell:
'''obiuniq -m sample {input} > {output} 2> {log}'''
### only sequence more than 20bp with no ambiguity IUAPC with total coverage greater than 10 reads
rule goodlength_samples:
input:
'samples/{sample}.uniq.fasta'
output:
'samples/{sample}.l.u.fasta'
log:
'log/goodlength_samples/{sample}.log'
shell:
'''obigrep -p 'count>10' -s '^[ACGT]+$' -p 'seq_length>20' {input} > {output} 2> {log}'''
### Clean the sequences for PCR/sequencing errors (sequence variants)
rule clean_pcrerr_samples:
input:
'samples/{sample}.l.u.fasta'
output:
'samples/{sample}.r.l.u.fasta'
log:
'log/clean_pcrerr/{sample}.log'
shell:
'''obiclean -r 0.05 {input} > {output} 2> {log}'''
### Remove sequence which are classified as 'internal' by obiclean
rule rm_internal_samples:
input:
'samples/{sample}.r.l.u.fasta'
output:
'samples/{sample}.c.r.l.u.fasta'
log:
'log/rm_internal_samples/{sample}.log'
shell:
'''obigrep -p 'obiclean_internalcount == 0' {input} > {output} 2> {log}'''
RUNS, = glob_wildcards('raw/{run}_R1.fastq.gz')
rule all:
input:
expand('runs/{run}_run.fasta',run=RUNS)
### Concatenate sequences from each sample of the same run
rule cat_samples:
params:
'samples/{run}*.c.r.l.u.fasta'
output:
'runs/{run}_run.fasta'
shell:
'''cat {params} > {output}'''
\ No newline at end of file
RUNS, = glob_wildcards('raw/{run}_R1.fastq.gz')
rule all:
input:
expand('runs/{run}_run.fasta',run=RUNS),
expand('runs/{run}_run.uniq.fasta', run=RUNS),
expand('runs/{run}_run.tag.u.fasta', run=RUNS),
expand('runs/{run}_run.a.t.u.fasta', run=RUNS),
expand('runs/{run}_run.s.a.t.u.fasta', run=RUNS),
expand('tables/{run}.csv', run=RUNS),
expand('log/dereplicate_runs/{run}.log', run=RUNS),
expand('log/assign_taxon/{run}.log', run=RUNS),
expand('log/rm_attributes/{run}.log', run=RUNS),
expand('log/sort_runs/{run}.log', run=RUNS),
expand('log/table_runs/{run}.log',run=RUNS)
### Dereplicate and merge samples together
rule dereplicate_runs:
input:
'runs/{run}_run.fasta'
output:
'runs/{run}_run.uniq.fasta'
log:
'log/dereplicate_runs/{run}.log'
shell:
'''obiuniq -m sample {input} > {output} 2> {log}'''
### Assign each sequence to a taxon
rule assign_taxon:
input:
'runs/{run}_run.uniq.fasta'
output:
'runs/{run}_run.tag.u.fasta'
params:
bdr='bdr/embl_std',
fasta='bdr/db_std.fasta'
log:
'log/assign_taxon/{run}.log'
shell:
'''ecotag -d {params.bdr} -R {params.fasta} {input} > {output} 2> {log}'''
### Some unuseful attributes can be removed at this stage
rule rm_attributes:
input:
'runs/{run}_run.tag.u.fasta'
output:
'runs/{run}_run.a.t.u.fasta'
log:
'log/rm_attributes/{run}.log'
shell:
'''obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount \
--delete-tag=obiclean_count --delete-tag=obiclean_singletoncount \
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
--delete-tag=obiclean_head --delete-tag=obiclean_headcount \
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=obiclean_status \
--delete-tag=seq_length_ori --delete-tag=sminL --delete-tag=sminR \
--delete-tag=reverse_score --delete-tag=reverse_primer --delete-tag=reverse_match --delete-tag=reverse_tag \
--delete-tag=forward_tag --delete-tag=forward_score --delete-tag=forward_primer --delete-tag=forward_match \
--delete-tag=tail_quality {input} > {output} 2> {log}'''
### The sequences can be sorted by decreasing order of count
rule sort_runs:
input:
'runs/{run}_run.a.t.u.fasta'
output:
'runs/{run}_run.s.a.t.u.fasta'
log:
'log/sort_runs/{run}.log'
shell:
'''obisort -k count -r {input} > {output} 2> {log}'''
### Generate a table final results
rule table_runs:
input:
'runs/{run}_run.s.a.t.u.fasta'
output:
'tables/{run}.csv'
log:
'log/table_runs/{run}.log'
shell:
'''obitab -o {input} > {output} 2> {log}'''
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment