Only_obitools pipeline using NEXTFLOW ===================================== # Table of contents 1. [Introduction](#1-introduction) 2. [Installation](#2-installation) 1. [Requirements](#21-requirements) 2. [Initialisation](#22-initialisation) 3. [Reporting bugs](#3-reporting-bugs) 4. [Running the pipeline](#4-running-the-pipeline) ----------------- # 1. Introduction Here, we reproduce the bioinformatics pipeline used by [SPYGEN](http://www.spygen.com/) to generate species environmental presence from raw eDNA data. This pipeline is based on [OBItools](https://git.metabarcoding.org/obitools/obitools/wikis/home) a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding. This pipeline use the workflow management system [nextflow](https://www.nextflow.io/). So you will need to install it. If you don't want to use a workflow management system, an "only bash" version is alternatively available [here](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools). # 2. Installation ## 2.1. Requirements In order to run "only_obitools", you need a couple of programs. Most of them should be available pre-compiled for your distribution. The programs and libraries you absolutely need are: - [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools) - [Java 8 (or later)](https://www.nextflow.io/docs/latest/getstarted.html) In addition, you will need a reference database for taxonomic assignment. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database). ## 2.2. Initiatilisation * open a shell * make a folder, name it yourself, I named it workdir ``` mkdir workdir cd workdir ``` * clone the project and switch to the main folder, it's your working directory ``` git clone http://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools.git cd nextflow_obitools ``` * define 2 external folders : - folder which contains reference database files. You can build a reference database by following the instructions [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database). - folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https:///pythonhosted.org/OBITools/scripts/ngsfilter.html). # 3. Reporting bugs If you're sure you've found a bug — e.g. if one of my programs crashes with an obscur error message, or if the resulting file is missing part of the original data, then by all means submit a bug report. I use [GitLab's issue system](https://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools/issues) as my bug database. You can submit your bug reports there. Please be as verbose as possible — e.g. include the command line, etc # 4. Running the pipeline Quickstart 1. create a new folder for nextflow to work in 2. switch to this new folder 3. open a shell 4. type this command to download nextflow into this folder ``` curl -fsSL get.nextflow.io | bash ``` 5. make sure that the programs stated in the Requirements section below are installed on your machine. After nextflow is downloaded, replace all the "YOUR_***" parts in the following command with your own paths 6. run your command Demultiplexing and filtering of the eDNA metabarcoding raw data ``` ./nextflow run scripts/step1.nf --datafolder 'path/to/fastq/and/dat/files' ``` Outputs are stored into newly created `work/` folder. Concatenating sample by run id ``` bash scripts/step2.sh ``` Cleaned sequences for each run are stored into newly created `runs/` folder. Taxonomic assignment and generating matrix species/sample for each run ``` ./nextflow run scripts/step3.nf --db_ref /path/to/reference/database/and/prefix --db_fasta /path/to/reference/database/fasta/file ``` To build your own reference database see the details [here](http://gitlab.mbb.univ-montp2.fr/edna/reference_database). Alternatively, you can run into one single command the whole pipeline by typing : ``` bash main.sh path/to/fastq/and/dat/files /path/to/reference/database/and/prefix /path/to/reference/database/fasta/file ``` that's it ! The pipeline is running and crunching your data. Look for the overview.txt or. overview_new.txt in your output folder after the pipeline is finished