Commit 7fcb88b1 authored by peguerin's avatar peguerin
Browse files

readme update

parent 0d071c30
Only_obitools pipeline using SNAKEMAKE
Metabarcoding Only_obitools workflow using SNAKEMAKE
======================================
[![https://www.singularity-hub.org/static/simg/hosted-singularity--hub-%23e32929.svg](https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg)](https://singularity-hub.org/collections/2878)
**Pierre-Edouard Guerin, 2019**
_________________________________
# Table of contents
1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
3. [Reporting bugs](#3-reporting-bugs)
4. [Running the pipeline](#4-running-the-pipeline)
1. [Initialisation](#41-initialisation)
2. [Configuration](#42-configuration)
3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command)
4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step)
5. [Results](#5-results)
1. [Dependencies](#21-dependencies)
2. [Singularity containers](#22-singularity-containers)
3. [Reference database](#23-refererence-database)
3. [Running the workflow](#3-running-the-workflow)
1. [Initialisation](#31-initialisation)
2. [Configuration](#32-configuration)
3. [Run the workflow into a single command](#33-run-the-workflow-into-a-single-command)
4. [Run the workflow step by step](#34-run-the-workflow-step-by-step)
4. [Results](#4-results)
_________________________________
-----------------
# 1. Introduction
Here, we reproduce the bioinformatics pipeline used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This pipeline is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.
Here, we reproduce the bioinformatics workflow used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This workflow is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.
* This workflow is designed to work on a LINUX debian-based system
* We use the workflow management system [snakemake](https://bitbucket.org/snakemake/snakemake). So you will need to install it.
* We use a container generated by [singularity](https://singularity.lbl.gov/install-linux). So you will need to install it.
* If you don't want to use neither a workflow management system nor a container, an "only bash" version is alternatively available [here](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools).
This pipeline use the workflow management system [snakemake](https://bitbucket.org/snakemake/snakemake). So you will need to install it. If you don't want to use a workflow management system, an "only bash" version is alternatively available [here](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools).
![pipeline schema](schema_only_obitools.png)
![workflow schema](schema_only_obitools.png)
# 2. Installation
In order to run "snakemake_only_obitools", you need a couple of programs. Most of
them should be available pre-compiled for your distribution. The
## 2.1 Dependencies
In order to run "snakemake_only_obitools", you need a couple of programs. The
programs and libraries you absolutely need are:
- [python3](https://www.python.org/download/releases/3.0/)
- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [singularity](https://singularity.lbl.gov/install-linux)
- [snakemake](https://bitbucket.org/snakemake/snakemake)
In addition, you will need a reference database for taxonomic assignment. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
## 2.2 Singularity containers
All the others softwares you need to run the workflow have been installed in a singularity container.
To download this container :
# 3. Reporting bugs
```
singularity pull --name obitools.simg shub://Grelot/bioinfo_singularity_recipes:obitools
```
You will get a file named `obitools.simg`. Snakemake will need it to run softwares it contains.
If you're sure you've found a bug — e.g. if one program crashes
with an obscur error message, or if the resulting file is missing part
of the original data, then by all means submit a bug report.
I use [GitLab's issue system](http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools/issues)
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc
## 2.3 reference database
# 4. The pipeline
In addition, you will need a "miseq" reference database for taxonomic assignment. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
## 4.1 Initialisation
# 3. The workflow
## 3.1 Initialisation
* open a shell
* make a folder, name it yourself, I named it workdir
* make a folder, name it yourself, I named it `workdir`
```
mkdir workdir
......@@ -67,26 +90,35 @@ cd workdir
git clone http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools.git
cd snakemake_only_obitools
```
* define 2 external folders :
* define 3 external folders :
- folder which contains `obitools.simg` singularity container file. See [Singularity containers](#22-singularity-containers).
- folder which contains reference database files. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `{run}_R1.fastq.gz` and `{run}_R2.fastq.gz` where wildcard `{run}` is the name of the sequencing run. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* Overview of the steps
Here the excepted list of external directories in a tree-like format:
0. Configuration
1. Merge illumina paired-end sequences by pair
2. Assign each merged sequence to the corresponding sample
3. Dereplicates sequences
4. Filter unique sequences according to their qualities and abundances
5. Remove singleton and PCR errors
6. assign each sequences to a species
7. write a matrix species/sample
```
workdir
├── conteneur
│   └── obitools.simg
├── edna_miseq_rawdata
│   ├── sample_description1.dat
│   ├── sample_description2.dat
│   ├── seqrunA_R1.fastq.gz
│   ├── seqrunA_R2.fastq.gz
│   ├── seqrunB_R1.fastq.gz
│   └── seqrunB_R2.fastq.gz
├── reference_database
│   ├── db_embl_std.fasta
│   └── embld_std*
└── snakemake_only_obitools/
```
## 4.2 Configuration
## 3.2 Configuration
Useful parameters for each program are stored into the file [config.yaml](config.yaml)
Before to run the pipeline, you have to set your parameters. Please edit [config.yaml](config.yaml).
Before to run the workflow, you have to set your parameters. Please edit [config.yaml](config.yaml).
```diff
illuminapairedend:
......@@ -125,21 +157,21 @@ assign_taxon:
- step : assign each sequences to a species
## 4.3 Run the pipeline into a single command
## 3.3 Run the workflow into a single command
```
bash main.sh /path/to/fastq_dat_files /path/to/reference_database 16
```
order of arguments is important : 1) path to the folder which contains paired-end raw reads files and sample description file 2) path to the folder which contains reference database files 3) number of available cores (here for instance 16 cores)
that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished. See [Results](#5-results) section.
that's it ! The workflow is running and crunching your data. Look for the log folder output folder after the workflow is finished. See [Results](#5-results) section.
## 4.4 Run the pipeline step by step
## 3.4 Run the workflow step by step
run the pipeline step by step : open the file [main.sh](main.sh) to see details
run the workflow step by step : open the file [main.sh](main.sh) to see details
# 5. Results
# 4. Results
Let's define some wildcards `*` :
- `{run}` : ID of any sequencing run
......@@ -157,7 +189,7 @@ Provided data (see [Initialisation](#41-initialisation)) as input are stored int
## Intermediate files
Generated files at each step of the pipeline. Anything needed for building the final results output files.
Generated files at each step of the workflow. Anything needed for building the final results output files.
* `assembled`
- `assembled/{run}/{run}.fastq` : merged illumina paired-end sequences. It was made by [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend).
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment