Commit 964efc18 authored by peguerin's avatar peguerin
Browse files

integrate config

parent 397ab4ee
......@@ -7,6 +7,7 @@ Only_obitools pipeline using SNAKEMAKE
2. [Installation](#2-installation)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
5. [Results](#5-results)
-----------------
......@@ -38,9 +39,10 @@ I use [GitLab's issue system](https://gitlab.com/edna/only_obitools/issues)
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc
# 4. Running the pipeline
# 4. The pipeline
## 4.1 Initialisation
Quickstart
* open a shell
* make a folder, name it yourself, I named it workdir
......@@ -55,10 +57,66 @@ cd workdir
git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd snakemake_only_obitools
```
* define 2 folders :
- folder which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* run the pipeline :
* define 2 folders into the current directory :
- folder `bdr` which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder `raw` which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* Overview of the steps
0. Configuration
1. Merge illumina paired-end sequences by pair
2. Assign each merged sequence to the corresponding sample
3. Dereplicates sequences
4. Filter unique sequences according to their qualities and abundances
5. Remove singleton and PCR errors
6. assign each sequences to a species
7. write a matrix species/sample
## 4.2 Configuration
Parameters for each program are stored into the file [config.yaml](config.yaml)
Before to run the pipeline, you have to set your paramaters. Please edit [config.yaml](config.yaml).
```fill
illuminapairedend:
- s_min : 40
good_length_samples:
- count : 10
- seq_length : 20
clean_pcrerr_samples:
- r : 0.05
assign_taxon:
- bdr : bdr/embl_std
- fasta : bdr/db_std.fasta
```
* `s_min : 40` :score for keeping alignment. If the alignment score is below this threshold both the sequences are just concatenated. The mode attribute is set to the value joined.
- software : `illuminapairedend`
- step : merge illumina paired-end sequences by pair
- we set this value at 40
* `count : 10` : minimum number of copy for keeping a sequence.
- software : `obigrep`
- step : filter unique sequences according to their qualities and abundances
- we set this value at 10
* `seq_length : 20` : minimum length for keeping a sequence.
- software : `obigrep`
- step : filter unique sequences according to their qualities and abundances
- we set this value at 20
* `r : 0.05` : threshold ratio between counts (rare/abundant counts) of two sequence records so that the less abundant one is a variant of the more abundant
- software : `obiclean`
- step : remove singleton and PCR errors
- we set this value at 0.05
* `bdr : bdr/embl_std` : relative path to the folder `bdr` which contains reference database files. You have to add the prefix of the ref database files for instance "embl_something"
- software : `ecotag`
- step : assign each sequences to a species
* `fasta : bdr/db_std.fasta` : relative path to the fasta file of the reference database.
- software : `ecotag`
- step : assign each sequences to a species
## 4.3 Run the pipeline into a single command
```
bash main.sh /path/to/fastq_dat_files /path/to/reference_database_folder 16
......@@ -70,3 +128,16 @@ open the file `main.sh` to see details
that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.
## 4.4 Run the pipeline step by step
# 5. Results
* `bdr`
r
uns
raw
samples
tables
work
assembled
WORK IN PROGRESS
illuminapairedend:
s_min : 40
good_length_samples:
count : 10
seq_length : 20
clean_pcrerr_samples:
r : 0.05
assign_taxon:
bdr : bdr/embl_std
fasta : bdr/db_std.fasta
#configfile: "config.yaml"
configfile: "config.yaml"
RUNS, = glob_wildcards('raw/{run}_R1.fastq.gz')
BARCODES, = glob_wildcards('barcodes/{barcode}.dat')
#print(BARCODES)
#print(RUNS)
DICBARCODES={}
i=0
for bc in BARCODES:
......@@ -29,8 +27,10 @@ rule illuminapairedend:
fq='assembled/{run}/{run}.fastq'
log:
'log/illuminapairedend/{run}.log'
params:
s_min=config["illuminapairedend"]["s_min"]
shell:
'''illuminapairedend -r {input.R2} {input.R1} --score-min=40 > {output.fq} 2> {log}'''
'''illuminapairedend -r {input.R2} {input.R1} --score-min={params.s_min} > {output.fq} 2> {log}'''
### Remove unaligned sequence records
rule remove_unaligned:
......
configfile: "config.yaml"
SAMPLES, = glob_wildcards('samples/{sample}.fasta')
rule all:
input:
expand('samples/{sample}.uniq.fasta',sample=SAMPLES),
......@@ -31,8 +31,11 @@ rule goodlength_samples:
'samples/{sample}.l.u.fasta'
log:
'log/goodlength_samples/{sample}.log'
params:
count=config["good_length_samples"]["count"]
seq_length=config["goodlength_samples"]["seq_length"]
shell:
'''obigrep -p 'count>10' -s '^[ACGT]+$' -p 'seq_length>20' {input} > {output} 2> {log}'''
'''obigrep -p 'count>{params.count}' -s '^[ACGT]+$' -p 'seq_length>{params.seq_length}' {input} > {output} 2> {log}'''
### Clean the sequences for PCR/sequencing errors (sequence variants)
rule clean_pcrerr_samples:
......@@ -42,8 +45,10 @@ rule clean_pcrerr_samples:
'samples/{sample}.r.l.u.fasta'
log:
'log/clean_pcrerr/{sample}.log'
params:
r=config["clean_pcrerr_samples"]["r"]
shell:
'''obiclean -r 0.05 {input} > {output} 2> {log}'''
'''obiclean -r {params.r} {input} > {output} 2> {log}'''
### Remove sequence which are classified as 'internal' by obiclean
rule rm_internal_samples:
......
configfile: "config.yaml"
RUNS, = glob_wildcards('raw/{run}_R1.fastq.gz')
rule all:
......@@ -32,8 +33,8 @@ rule assign_taxon:
output:
'runs/{run}_run.tag.u.fasta'
params:
bdr='bdr/embl_std',
fasta='bdr/db_std.fasta'
bdr=config["assign_taxon"]["bdr"]
fasta=config["assign_taxon"]["fasta"]
log:
'log/assign_taxon/{run}.log'
shell:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment