README.md 5.42 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
3
4
5
6
7
8
9
Only_obitools pipeline using SNAKEMAKE
======================================

# Table of contents

1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
peguerin's avatar
peguerin committed
10
5. [Results](#5-results)
peguerin's avatar
peguerin committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

-----------------

# 1. Introduction

Here, we reproduce the bioinformatics pipeline used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This pipeline is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.


# 2. Installation

In order to run "snakemake_only_obitools", you need a couple of programs. Most of
them should be available pre-compiled for your distribution. The
programs and libraries you absolutely need are:

- [python3](https://www.python.org/download/releases/3.0/)

- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)

- [snakemake](https://bitbucket.org/snakemake/snakemake)


# 3. Reporting bugs

If you're sure you've found a bug — e.g. if one of my programs crashes
with an obscur error message, or if the resulting file is missing part
of the original data, then by all means submit a bug report.

I use [GitLab's issue system](https://gitlab.com/edna/only_obitools/issues)
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc

peguerin's avatar
peguerin committed
42
43
44
# 4. The pipeline

## 4.1 Initialisation
peguerin's avatar
peguerin committed
45
46


peguerin's avatar
peguerin committed
47
48
* open a shell
* make a folder, name it yourself, I named it workdir
peguerin's avatar
peguerin committed
49

peguerin's avatar
peguerin committed
50
51
52
53
```
mkdir workdir
cd workdir
```
peguerin's avatar
peguerin committed
54
* clone the project and switch to the main folder, it's your working directory
peguerin's avatar
peguerin committed
55

peguerin's avatar
peguerin committed
56
57
58
59
```
git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd snakemake_only_obitools
```
peguerin's avatar
peguerin committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
* define 2 folders into the current directory : 
    - folder `bdr` which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
    - folder `raw` which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).

* Overview of the steps

	0. Configuration
	1. Merge illumina paired-end sequences by pair
	2. Assign each merged sequence to the corresponding sample
	3. Dereplicates sequences
	4. Filter unique sequences according to their qualities and abundances
	5. Remove singleton and PCR errors
	6. assign each sequences to a species
	7. write a matrix species/sample


## 4.2 Configuration

Parameters for each program are stored into the file [config.yaml](config.yaml)
Before to run the pipeline, you have to set your paramaters. Please edit [config.yaml](config.yaml).


```fill
illuminapairedend:
-  s_min : 40
good_length_samples:
-  count : 10
-  seq_length : 20
clean_pcrerr_samples:
-  r : 0.05
assign_taxon:
-  bdr : bdr/embl_std
-  fasta : bdr/db_std.fasta
```

* `s_min : 40` :score for keeping alignment. If the alignment score is below this threshold both the sequences are  just  concatenated. The mode attribute is set to the value joined.
	- software : `illuminapairedend`
	- step : merge illumina paired-end sequences by pair
	- we set this value at 40
* `count : 10` : minimum number of copy for keeping a sequence.
	- software : `obigrep`
	- step : filter unique sequences according to their qualities and abundances
	- we set this value at 10
* `seq_length : 20` : minimum length for keeping a sequence.
	- software : `obigrep`
	- step : filter unique sequences according to their qualities and abundances
	- we set this value at 20
* `r : 0.05` : threshold ratio between counts  (rare/abundant  counts)  of  two sequence records  so that the less abundant one is a variant of the more abundant
	- software : `obiclean`
	- step : remove singleton and PCR errors
	- we set this value at 0.05
* `bdr : bdr/embl_std` : relative path to the folder `bdr` which contains reference database files. You have to add the prefix of the ref database files for instance "embl_something"
	- software : `ecotag`
	- step : assign each sequences to a species
* `fasta : bdr/db_std.fasta` : relative path to the fasta file of the reference database.
	- software : `ecotag`
	- step : assign each sequences to a species


## 4.3 Run the pipeline into a single command
peguerin's avatar
peguerin committed
120

peguerin's avatar
peguerin committed
121
122
123
124
```
bash main.sh /path/to/fastq_dat_files /path/to/reference_database_folder 16
```
order of arguments is important : 1) path to the folder which contains paired-end raw reads files and sample description file 2) path to the folder which contains reference database files 3) number of available cores (here for instance 16 cores)
peguerin's avatar
peguerin committed
125

peguerin's avatar
peguerin committed
126
* run the pipeline step by step :
peguerin's avatar
peguerin committed
127
128
129
130
open the file `main.sh` to see details

that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.

peguerin's avatar
peguerin committed
131
132
133
134
135
136
137
138
139
140
141
142
143
## 4.4 Run the pipeline step by step

# 5. Results
* `bdr`
r
uns
raw
samples
tables
work
assembled

WORK IN PROGRESS