README.md 12.9 KB
Newer Older
peguerin's avatar
peguerin committed
1
Metabarcoding Only_obitools workflow using SNAKEMAKE
peguerin's avatar
peguerin committed
2
3
======================================

peguerin's avatar
peguerin committed
4
5
6
[![https://www.singularity-hub.org/static/simg/hosted-singularity--hub-%23e32929.svg](https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg)](https://singularity-hub.org/collections/2878)


peguerin's avatar
peguerin committed
7
**Virginie Marques, Pierre-Edouard Guerin, 2019**
peguerin's avatar
peguerin committed
8
9
10

_________________________________

peguerin's avatar
peguerin committed
11
12
13
14
# Table of contents

1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
peguerin's avatar
peguerin committed
15
16
	1. [Dependencies](#21-dependencies)
	2. [Singularity containers](#22-singularity-containers)
peguerin's avatar
peguerin committed
17
18
  3. [Conda environment](#24-conda-environment)
	4. [Reference database](#24-reference-database)
peguerin's avatar
peguerin committed
19
20
21
22
23
24
25
26
27
3. [Running the workflow](#3-running-the-workflow)
	1. [Initialisation](#31-initialisation)
	2. [Configuration](#32-configuration)
	3. [Run the workflow into a single command](#33-run-the-workflow-into-a-single-command)
	4. [Run the workflow step by step](#34-run-the-workflow-step-by-step)
4. [Results](#4-results)


_________________________________
peguerin's avatar
peguerin committed
28
29
30
31


# 1. Introduction

peguerin's avatar
peguerin committed
32
33
34
35
36
37
38
39
Here, we reproduce the bioinformatics workflow used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This workflow is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.


* This workflow is designed to work on a LINUX debian-based system
* We use the workflow management system [snakemake](https://bitbucket.org/snakemake/snakemake). So you will need to install it. 
* We use a container generated by [singularity](https://singularity.lbl.gov/install-linux). So you will need to install it.
* If you don't want to use neither a workflow management system nor a container, an "only bash" version is alternatively available [here](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools).

peguerin's avatar
peguerin committed
40

peguerin's avatar
peguerin committed
41

peguerin's avatar
peguerin committed
42
![workflow schema](schema_only_obitools.png)
peguerin's avatar
peguerin committed
43

peguerin's avatar
peguerin committed
44
45
46

# 2. Installation

peguerin's avatar
peguerin committed
47
48
49
## 2.1 Dependencies

In order to run "snakemake_only_obitools", you need a couple of programs. The
peguerin's avatar
peguerin committed
50
51
52
53
programs and libraries you absolutely need are:

- [python3](https://www.python.org/download/releases/3.0/)

peguerin's avatar
peguerin committed
54
- [singularity](https://singularity.lbl.gov/install-linux)
peguerin's avatar
peguerin committed
55
56
57

- [snakemake](https://bitbucket.org/snakemake/snakemake)

peguerin's avatar
peguerin committed
58
59
60
61
62
## 2.2 Singularity containers

All the others softwares you need to run the workflow have been installed in a singularity container.

To download this container :
peguerin's avatar
peguerin committed
63

peguerin's avatar
peguerin committed
64
65
66
67
```
singularity pull --name obitools.simg shub://Grelot/bioinfo_singularity_recipes:obitools
```
You will get a file named `obitools.simg`. Snakemake will need it to run softwares it contains.
peguerin's avatar
peguerin committed
68
69


peguerin's avatar
peguerin committed
70
## 2.3 Conda environment
peguerin's avatar
peguerin committed
71

peguerin's avatar
peguerin committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
The default conda solver is a bit slow and sometimes has issues with selecting the latest package releases. Therefore, we recommend to install Mamba as a drop-in replacement via

```
conda install -c conda-forge mamba
```
Then, you can install Snakemake, pandas, biopython and dependencies with
```
mamba create -n snakemake_only_obitools -c conda-forge -c bioconda snakemake biopython
```
from the conda-forge and bioconda channels. This will install all required software into an isolated software environment, that has to be activated with
conda activate snakemake_only_obitools


## 2.4 Reference database
peguerin's avatar
peguerin committed
86

peguerin's avatar
peguerin committed
87
In addition, you will need a "miseq" reference database for taxonomic assignment. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
peguerin's avatar
peguerin committed
88

peguerin's avatar
peguerin committed
89
90
91
92

# 3. The workflow

## 3.1 Initialisation
peguerin's avatar
peguerin committed
93
94


peguerin's avatar
peguerin committed
95
* open a shell
peguerin's avatar
peguerin committed
96
* make a folder, name it yourself, I named it `workdir`
peguerin's avatar
peguerin committed
97

peguerin's avatar
peguerin committed
98
99
100
101
```
mkdir workdir
cd workdir
```
peguerin's avatar
peguerin committed
102
* clone the project and switch to the main folder, it's your working directory
peguerin's avatar
peguerin committed
103

peguerin's avatar
peguerin committed
104
105
106
107
```
git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd snakemake_only_obitools
```
peguerin's avatar
peguerin committed
108
109
* define 3 external folders : 
    - folder which contains `obitools.simg` singularity container file. See [Singularity containers](#22-singularity-containers).
peguerin's avatar
peguerin committed
110
    - folder which contains reference database files. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
peguerin's avatar
peguerin committed
111
    - folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `{run}_R1.fastq.gz` and `{run}_R2.fastq.gz` where wildcard `{run}` is the name of the sequencing run. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
peguerin's avatar
peguerin committed
112

peguerin's avatar
peguerin committed
113
Here the excepted list of external directories in a tree-like format:
peguerin's avatar
peguerin committed
114

peguerin's avatar
peguerin committed
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
```
workdir
├── conteneur
│   └── obitools.simg
├── edna_miseq_rawdata
│   ├── sample_description1.dat
│   ├── sample_description2.dat
│   ├── seqrunA_R1.fastq.gz
│   ├── seqrunA_R2.fastq.gz
│   ├── seqrunB_R1.fastq.gz
│   └── seqrunB_R2.fastq.gz
├── reference_database
│   ├── db_embl_std.fasta
│   └── embld_std*
└── snakemake_only_obitools/
```
peguerin's avatar
peguerin committed
131
132


peguerin's avatar
peguerin committed
133
## 3.2 Configuration
peguerin's avatar
peguerin committed
134

peguerin's avatar
peguerin committed
135
Useful parameters for each program are stored into the file [config.yaml](config.yaml)
peguerin's avatar
peguerin committed
136
Before to run the workflow, you have to set your parameters. Please edit [config.yaml](config.yaml).
peguerin's avatar
peguerin committed
137

peguerin's avatar
peguerin committed
138
139
Here an example of a config.yaml file:

peguerin's avatar
peguerin committed
140
```diff
peguerin's avatar
peguerin committed
141
container:
peguerin's avatar
peguerin committed
142
  /workdir/conteneur/obitools.simg
peguerin's avatar
peguerin committed
143
144
145
146
147
148
149
150
fastqFolderPath:
  /workdir/edna_miseq_rawdata/
fastqFiles:
  - seqrunA
  - seqrunB
barcodeFiles:
  seqrunA : sample_description1.dat
  seqrunB : sample_description2.dat
peguerin's avatar
peguerin committed
151
illuminapairedend:
peguerin's avatar
peguerin committed
152
  s_min : 40
peguerin's avatar
peguerin committed
153
good_length_samples:
peguerin's avatar
peguerin committed
154
155
  count : 10
  seq_length : 20
peguerin's avatar
peguerin committed
156
clean_pcrerr_samples:
peguerin's avatar
peguerin committed
157
  r : 0.05
peguerin's avatar
peguerin committed
158
assign_taxon:
peguerin's avatar
peguerin committed
159
160
  bdr : /workdir/reference_database/embl_std
  fasta : /workdir/reference_database/db_embl_std.fasta
peguerin's avatar
peguerin committed
161
162
```

peguerin's avatar
peguerin committed
163
Table of parameters into config.yaml file:
peguerin's avatar
peguerin committed
164

peguerin's avatar
peguerin committed
165
parameters |  descriptions   | softwares |  rules             | default values    | excepted type
peguerin's avatar
peguerin committed
166
---------|------------------|-------|------------------|---------|----
peguerin's avatar
peguerin committed
167
container | absolute path of singularity container file `obitools.simg` | [singularity](https://singularity.lbl.gov/) | every rules need this container to work | /workdir/conteneur/obitools.simg  | absolute path of `simg` file
peguerin's avatar
peguerin committed
168
169
fastqFolderPath   | absolute path of a folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. | [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend), [ngsfilter](https://pythonhosted.org/OBITools/scripts/ngsfilter.html)  | illuminapairedend, assign_sequences | /workdir/edna_miseq_rawdata/  | absolute path of a folder
fastqFiles | list of wildcard `{run}` where `{run}` is the name of the sequencing run | [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend) | illuminapairedend |  seqrunA, seqrunB | list of strings
peguerin's avatar
peguerin committed
170
barcodeFiles | a dictionary with `{run}` as key and associated sample description file as value |  [ngsfilter](https://pythonhosted.org/OBITools/scripts/ngsfilter.html) | assign_sequences |  {seqrunA : sample_description1.dat, seqrunB : sample_description2.dat} | dictionary
peguerin's avatar
peguerin committed
171
172
173
174
175
176
177
s_min | score for keeping alignment. If the alignment score is below this threshold both the sequences are  just  concatenated. The mode attribute is set to the value joined | [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend) | illuminapairedend | 40 | integer
count | minimum number of copy for keeping a sequence | [obigrep](https://pythonhosted.org/OBITools/scripts/obigrep.html?highlight=obigrep#module-obigrep) | good_length_samples | 10 | integer
seq_length | minimum length for keeping a sequence | [obigrep](https://pythonhosted.org/OBITools/scripts/obigrep.html?highlight=obigrep#module-obigrep) | good_length_samples | 20 | integer
r | threshold ratio between counts  (rare/abundant  counts)  of  two sequence records  so that the less abundant one is a variant of the more abundant | [obiclean](https://pythonhosted.org/OBITools/scripts/obiclean.html?highlight=obiclean#module-obiclean) | clean_pcrerr_samples | 0.05 | float
bdr | absolute path to the folder which contains reference database files and prefix | [ecotag](https://pythonhosted.org/OBITools/scripts/ecotag.html?highlight=ecotag#module-ecotag) | assign_taxon | /workdir/reference_database/embl_std | absolute path of a folder + prefix
fasta | absolute path to the fasta file of the reference database | [ecotag](https://pythonhosted.org/OBITools/scripts/ecotag.html?highlight=ecotag#module-ecotag) | assign_taxon | /workdir/reference_database/db_embl_std.fasta | absolute path to a fasta file

peguerin's avatar
peguerin committed
178
179


peguerin's avatar
peguerin committed
180
## 3.3 Run the workflow into a single command
peguerin's avatar
peguerin committed
181

peguerin's avatar
peguerin committed
182
```
peguerin's avatar
peguerin committed
183
184
CORES=16
bash main.sh $CORES
peguerin's avatar
peguerin committed
185
```
peguerin's avatar
peguerin committed
186

peguerin's avatar
peguerin committed
187
188
with `CORES` the number of available cores to apply parallelization on the workflow.

peguerin's avatar
peguerin committed
189

peguerin's avatar
peguerin committed
190
that's it ! The workflow is running and crunching your data. Look for the [04-final_tables](04-final_tables) and [99-log](99-log) folders after the workflow is finished. See [Results](#4-results) section.
peguerin's avatar
peguerin committed
191

peguerin's avatar
peguerin committed
192
193
194
195
196
197
198
199
200
201
Alternatively you can use `conda` instead of `singularity`

```
CORES=16
conda activate snakemake_only_obitools
bash main_conda.sh $CORES
```



peguerin's avatar
peguerin committed
202

peguerin's avatar
peguerin committed
203
## 3.4 Run the workflow step by step
peguerin's avatar
peguerin committed
204

peguerin's avatar
peguerin committed
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
Open the file [main.sh](main.sh) to see details:

### 3.4.1 Merge paired-end sequences and demultiplexing

```
CORES=16
cd 01-assembly
snakemake -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 120
cd ..
```
Check intermediate files into [01-assembly](01-assembly) folder.  
Demultiplexed files will be generated into [02-demultiplex/01-raw](02-demultiplex/01-raw) folder.


### 3.4.2 Filter sequences

```
CORES=16
cd 02-demultiplex
snakemake -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 120
cd ..
```

Check intermediate files into [02-demultiplex](02-demultiplex) folder.  
Check filtered sequences into [02-demultiplex/03-cleaned](02-demultiplex/03-cleaned) folder.


### 3.4.3 Concatenate sample files into run files

```
for run in `ls 02-demultiplex/03-cleaned/`;
do cat 02-demultiplex/03-cleaned/${run}/*.c.r.l.u.fasta > 03-filtered/01-runs/${run}_run.fasta ;
done
```

Check concatened `{run}` files into [03-filtered/01-runs/](03-filtered/01-runs/) folder.

### 3.4.4 Taxonomic assignation and format fasta files into tables

```
CORES=16 
cd 03-filtered
snakemake -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 120
cd ..
```
Check intermediate files into [03-filtered](03-filtered) folder.  
Final tables will be generated into [04-final_tables](04-final_tables) folder.

peguerin's avatar
peguerin committed
253

peguerin's avatar
peguerin committed
254

peguerin's avatar
peguerin committed
255
# 4. Results
peguerin's avatar
peguerin committed
256
257
258
259
260
261
262

Let's define some wildcards `*` :
- `{run}` : ID of any sequencing run
- `{sample}` : ID of any sample

## Input files

peguerin's avatar
peguerin committed
263
Provided data (see [Initialisation](#31-initialisation)):
peguerin's avatar
peguerin committed
264

peguerin's avatar
peguerin committed
265
* Singularity container
peguerin's avatar
peguerin committed
266

peguerin's avatar
peguerin committed
267
* Reference database files. Relative path of fasta files and used prefix must be written in [config.yaml](config.yaml) (see [Configuration](#32-configuration)).
peguerin's avatar
peguerin committed
268

peguerin's avatar
peguerin committed
269
* FASTQ files. Raw paired-end reads files `{run}_R1.fastq.gz` and `{run}_R2.fastq.gz` where `{run}` is the name of the sequencing run (see [Configuration](#32-configuration)).
peguerin's avatar
peguerin committed
270

peguerin's avatar
peguerin committed
271
* Sample description .dat files (see [Configuration](#32-configuration)).
peguerin's avatar
peguerin committed
272

peguerin's avatar
peguerin committed
273
## Intermediate files
peguerin's avatar
peguerin committed
274

peguerin's avatar
peguerin committed
275
A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.
peguerin's avatar
peguerin committed
276

peguerin's avatar
peguerin committed
277
A rule is designed to generate output files and log files.
peguerin's avatar
peguerin committed
278

peguerin's avatar
peguerin committed
279
280
281
* [01-assembly](01-assembly)
* [02-demultiplex](02-demultiplex)
* [03-filtered](03-filtered)
peguerin's avatar
peguerin committed
282
283


peguerin's avatar
peguerin committed
284
## Final results
peguerin's avatar
peguerin committed
285

peguerin's avatar
peguerin committed
286
* [04-final_tables](04-final_tables): this folder contains all the matrix species/sample for each `{run}`.
peguerin's avatar
peguerin committed
287
288
289

## Log files

peguerin's avatar
peguerin committed
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
* [99-log](99-log): this folder contains log files for every rules for every `{run}` for every `{sample}`.

```                
99-log/
├── 01-assembly
│   └── {run}.log
├── 02-remove_unaligned
│   └── {run}.log
├── 03-assign_sequences
│   └── {run}.log
├── 04-split_sequences
│   └── {run}.log
├── 05-dereplicate_samples
│   └── 161124_SND393_A_L005_GWM-849
├── 06-goodlength_samples
│   └── {run}/{sample}.log
├── 07-clean_pcrerr
│   └── {run}/{sample}.log
├── 08-rm_internal_samples
│   └── {run}/{sample}.log
├── 09-dereplicate_runs
│   └── {run}.log
├── 10-assign_taxon
│   └── {run}.log
├── 11-rm_attributes
│   └── {run}.log
├── 12-sort_runs
│   └── {run}.log
└── 13-table_runs
    └── {run}.log
```