# STACKS2 using SNAKEMAKE Workflow RADseq workflow using [STACKS2](http://creskolab.uoregon.edu/stacks/) This was designed to process RADseq data from [RESERVEBENEFIT](https://www.biodiversa.org/1023) project. # Table of contents 1. [Introduction](#1-introduction) 2. [Installation](#2-installation) 1. [Prerequisite](#21-prerequisite) 2. [Data Files](#22-data-files) 3. [Set up](#23-set-up) 3. [Reporting bugs](#3-reporting-bugs) 4. [Running the pipeline](#5-running-the-pipeline) 1. [Initialisation](#41-initialisation) 2. [Configuration](#42-configuration) 3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command) 4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step) # 1. Introduction blablabla # 2. Installation ## 2.1 Prerequisite You must install the following softwares and packages : - [SNAKEMAKE 5.3.0](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) * Check version and if the program is correctly installed by typing : ``` snakemake --version ## should give you the output 5.3.0 ``` - [STACKS 2.2](http://catchenlab.life.illinois.edu/stacks/) * Check version and if programs are correctly installed by typing : ``` process_radtags --version clone_filter --version gstacks --version populations --version ## should give you the output 2.2 ``` - [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial) * Download `bwa` at: http://sourceforge.net/projects/bio-bwa/files/ ``` tar -xvf bwa-x.x.x.tar.bz2 cd bwa-x.x.x ./configure --prefix=/where/to/install make make install ``` * Check version and if programs are correctly installed by typing : ``` bwa ## should give you the output Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.17-r1188 ... ``` - [SAMTOOLS 1.9 ](http://www.htslib.org/) * Download `htslib` and `samtools` at : http://www.htslib.org/download/ * Building each desired package from source is very simple: ``` cd htslib-1.x ./configure --prefix=/where/to/install make make install cd .. ## and similarly for samtools : cd samtools-1.x ./configure --prefix=/where/to/install make make install ``` * Check version and if programs are correctly installed by typing : ``` samtools --version ## should give you the output samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd. ``` ## 2.2 Data Files The included data files are : let's define some wildcards `*` - `{run}` : any runs - `{pool}` : any pools into a run - `{species}` : any species * [config.yaml](01-info_files/config.yaml) : * [barcodes.txt](01-info_files/barcodes.txt) : file containing barcodes used for {pool} into {run} * [{species}_infos.csv](01-info_files) : information `.csv` table related to {species} each row is a sample and they are 4 columns which are run,pool,barcode,ID * [{species}_populations_map.txt](01-info_files) : information table `.tsv` related to {species}. Each row is a sample and they are 2 columns which are ID,population. This file can be generated by the pipeline (see [Configuration](#42-configuration) section). However we strongly recommand you to do it manually. ## 2.3 Set Up clone the project and switch to the main folder, it's your working directory ``` git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git cd snakemake_stacks2 ``` You will see the following folders : * [00-scripts](00-scripts): contains all the required scripts to run the whole pipeline * [01-info_files](01-info_files) : contains all the required data files (see [Data Files](#22-data-files) section below) * [02-raw](02-raw) : must contain your data from paired-end illumina sequencing runs. The data must be stored this way : ``` 02-raw/ runA/ poolA1/ {poolA1}_R1_001.fastq.gz {poolA1}_R2_001.fastq.gz poolA2/ {poolA2}_R1_001.fastq.gz {poolA2}_R2_001.fastq.gz ... runB/ poolB1/ {poolB1}_R1_001.fastq.gz {poolB1}_R2_001.fastq.gz ... ... ``` * [03-samples](03-samples): will store the results generated by demultiplexing with [process_radtags](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) and clone filtering [clone_filter](http://catchenlab.life.illinois.edu/stacks/comp/clone_filter.php). The data must be stored this way : ``` 02-raw/ runA/ poolA1/ sample_{barcode1}.1.fq.gz sample_{barcode1}.2.fq.gz sample_{barcode2}.1.fq.gz sample_{barcode2}.2.fq.gz sample_{barcode3}.1.fq.gz sample_{barcode3}.2.fq.gz ... poolA1_clone_filtered/ sample_{barcode1}.1.1.fq.gz sample_{barcode1}.2.2.fq.gz sample_{barcode2}.1.1.fq.gz sample_{barcode2}.2.2.fq.gz sample_{barcode3}.1.1.fq.gz sample_{barcode3}.2.2.fq.gz ... poolA2/ sample_{barcode1}.1.fq.gz sample_{barcode1}.2.fq.gz ... poolA2_clone_filtered/ sample_{barcode1}.1.1.fq.gz sample_{barcode1}.2.2.fq.gz ... ... runB/ poolB1/ sample_{barcode1}.1.fq.gz sample_{barcode1}.2.fq.gz ... poolB1_clone_filtered/ sample_{barcode1}.1.1.fq.gz sample_{barcode1}.2.2.fq.gz ... ... ... ``` * [04-all_samples](04-all_samples): paired end `fastq.gz` files are named according to [{species}_infos.csv](01-info_files) information. Then reads are aligned onto reference genome sequences stored into [08-genomes](08-genomes). This folder contains "named" fatsq files and corresponding alignments `.bam` files. `.sorted.bam` are SORTED alignment files and `.sorted.bam.bai` are corresponding index. The data must be stored this way : ``` 02-raw/ speciesA/ {sampleA1}.1.fq.gz {sampleA1}.2.fq.gz {sampleA1}.bam {sampleA1}.sorted.bam {sampleA1}.sorted.bam.bai {sampleA2}.1.fq.gz {sampleA2}.2.fq.gz {sampleA2}.bam {sampleA2}.sorted.bam {sampleA2}.sorted.bam.bai ... speciesB/ {sampleB1}.1.fq.gz {sampleB1}.2.fq.gz {sampleB1}.bam {sampleB1}.sorted.bam {sampleB1}.sorted.bam.bai ... ... ``` * [05-stacks](05-stacks) : outputs from [gstacks](http://catchenlab.life.illinois.edu/stacks/comp/gstacks.php) * [06-populations](06-populations) : outputs from [populations](http://catchenlab.life.illinois.edu/stacks/comp/populations.php) * [08-genomes](08-genomes) : reference genome of each any species {species} used for the analysis. `.fasta` file is mandatory and stores all the scaffolds sequences of {species} genome assembly. `.amb`, `.ann`, `.bwt`, `.pac`, `.sa` are index files required by [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial). They will be automatically generated if absent. The data must be stored this way : ``` 08-genomes/ {species}_genome.amb {species}_genome.ann {species}_genome.bwt {species}_genome.fasta {species}_genome.pac {species}_genome.sa ``` * [10-logs](10-logs) : log files generated by every command - process_radtags - clone_filter - genome_alignment - gstacks - populations # 3. Reporting bugs If you're sure you've found a bug — e.g. if one of my programs crashes with an obscur error message, or if the resulting file is missing part of the original data, then by all means submit a bug report. I use [GitLab's issue system](http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2/issues) as my bug database. You can submit your bug reports there. Please be as verbose as possible — e.g. include the command line, etc # 4. Running the pipeline ## 4.1 Initialisation * open a shell * make a folder, name it yourself, I named it workdir ``` mkdir workdir cd workdir ``` * clone the project and switch to the main folder, it's your working directory ``` git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git cd snakemake_stacks2 ``` ## 4.2 Configuration WORK IN PROGRESS !!!! ## 4.3 Run the pipeline into a single command Once you finished [Initialisation](#41-initialisation) and [Configuration](#42-configuration) steps. You can run the whole pipeline simply typing : ``` ## number of CPU cores available for running the pipeline (for instance here 64 cores) N_CORES=64 ## run the pipeline into a single command bash main.sh $N_CORES ``` ## 4.4 Run the pipeline step by step WORK IN PROGRESS !!!! that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.