README.md 9.49 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
3
4
5
6
7
8
9
10
11
# STACKS2 using SNAKEMAKE Workflow

RADseq workflow using [STACKS2](http://creskolab.uoregon.edu/stacks/)
This was designed to process RADseq data from [RESERVEBENEFIT](https://www.biodiversa.org/1023) project.



# Table of contents

1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
peguerin's avatar
peguerin committed
12
13
14
    1. [Prerequisite](#21-prerequisite)
    2. [Data Files](#22-data-files)
    3. [Set up](#23-set-up)
peguerin's avatar
peguerin committed
15
3. [Reporting bugs](#3-reporting-bugs)
peguerin's avatar
peguerin committed
16
4. [Running the pipeline](#5-running-the-pipeline)
peguerin's avatar
peguerin committed
17
18
19
20
    1. [Initialisation](#41-initialisation)
    2. [Configuration](#42-configuration)
    3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command)
    4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step)
peguerin's avatar
peguerin committed
21
22
23
24


# 1. Introduction

peguerin's avatar
peguerin committed
25
26
27
This pipeline use the workflow management system [snakemake](https://bitbucket.org/snakemake/snakemake). So you will need to install it. 

![pipeline schema](schema_reservebenefit_stacks2.png)
peguerin's avatar
peguerin committed
28
29
30
31
32
33
34
35
36


# 2. Installation


## 2.1 Prerequisite
You must install the following softwares and packages :

- [SNAKEMAKE 5.3.0](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
peguerin's avatar
peguerin committed
37
    * Check version and if the program is correctly installed by typing :
peguerin's avatar
peguerin committed
38

peguerin's avatar
peguerin committed
39
40
41
42
43
    ```
    snakemake --version
    ## should give you the output
    5.3.0
    ```
peguerin's avatar
peguerin committed
44

peguerin's avatar
peguerin committed
45
- [STACKS 2.2](http://catchenlab.life.illinois.edu/stacks/)
peguerin's avatar
peguerin committed
46
   * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
47

peguerin's avatar
peguerin committed
48
49
50
51
52
53
    ```
    process_radtags --version
    clone_filter --version
    gstacks --version
    populations --version
    ## should give you the output
peguerin's avatar
peguerin committed
54
    2.2
peguerin's avatar
peguerin committed
55
    ```
peguerin's avatar
peguerin committed
56
57

- [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial)
peguerin's avatar
peguerin committed
58
    * Download `bwa` at: http://sourceforge.net/projects/bio-bwa/files/
peguerin's avatar
peguerin committed
59

peguerin's avatar
peguerin committed
60
61
62
63
64
65
66
    ```
    tar -xvf bwa-x.x.x.tar.bz2   
    cd bwa-x.x.x
    ./configure --prefix=/where/to/install
    make  
    make install
    ```
peguerin's avatar
peguerin committed
67
    * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
68

peguerin's avatar
peguerin committed
69
70
71
72
73
74
75
    ```
    bwa
    ## should give you the output
    Program: bwa (alignment via Burrows-Wheeler transformation)
    Version: 0.7.17-r1188
    ...
    ```
peguerin's avatar
peguerin committed
76
77
78
- [SAMTOOLS 1.9 ](http://www.htslib.org/)
    * Download `htslib` and `samtools` at : http://www.htslib.org/download/
    * Building each desired package from source is very simple:
peguerin's avatar
peguerin committed
79

peguerin's avatar
peguerin committed
80
81
82
83
84
85
86
87
88
89
90
91
92
    ```
    cd htslib-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    cd ..
    ## and similarly for samtools :
    cd samtools-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    ```
    * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
93

peguerin's avatar
peguerin committed
94
95
96
97
98
99
100
    ```
    samtools --version
    ## should give you the output
    samtools 1.9
    Using htslib 1.9
    Copyright (C) 2018 Genome Research Ltd.
    ```
peguerin's avatar
peguerin committed
101
102
103

## 2.2 Data Files
The included data files are :
peguerin's avatar
peguerin committed
104
105
106
107
let's define some wildcards `*`
- `{run}` : any runs
- `{pool}` : any pools into a run
- `{species}` : any species
peguerin's avatar
peguerin committed
108

peguerin's avatar
peguerin committed
109
* [config.yaml](01-info_files/config.yaml) : defines a dictionary of configuration parameters and their values used on each step commands of the pipeline.
peguerin's avatar
peguerin committed
110
111
112
* [barcodes.txt](01-info_files/barcodes.txt) : file containing barcodes used for {pool} into {run}
* [{species}_infos.csv](01-info_files) : information `.csv` table related to {species} each row is a sample and they are 4 columns which are run,pool,barcode,ID 
* [{species}_populations_map.txt](01-info_files) : information table `.tsv` related to {species}. Each row is a sample and they are 2 columns which are ID,population. This file can be generated by the pipeline (see [Configuration](#42-configuration) section). However we strongly recommand you to do it manually.
peguerin's avatar
peguerin committed
113
114
115
116
117

## 2.3 Set Up

clone the project and switch to the main folder, it's your working directory
```
peguerin's avatar
peguerin committed
118
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
peguerin's avatar
peguerin committed
119
120
cd snakemake_stacks2
```
peguerin's avatar
peguerin committed
121
122
You will see the following folders :

peguerin's avatar
peguerin committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
* [00-scripts](00-scripts): contains all the required scripts to run the whole pipeline
* [01-info_files](01-info_files) : contains all the required data files (see [Data Files](#22-data-files) section below)
* [02-raw](02-raw) : must contain your data from paired-end illumina sequencing runs. The data must be stored this way :
    ```
    02-raw/
        runA/
            poolA1/
                {poolA1}_R1_001.fastq.gz
                {poolA1}_R2_001.fastq.gz
            poolA2/
                {poolA2}_R1_001.fastq.gz
                {poolA2}_R2_001.fastq.gz
            ...
        runB/
            poolB1/
                {poolB1}_R1_001.fastq.gz
                {poolB1}_R2_001.fastq.gz
            ...
        ...        
    ```
* [03-samples](03-samples): will store the results generated by demultiplexing with [process_radtags](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) and clone filtering [clone_filter](http://catchenlab.life.illinois.edu/stacks/comp/clone_filter.php). The data must be stored this way :
   ```
peguerin's avatar
peguerin committed
145
    03-samples/
peguerin's avatar
peguerin committed
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
        runA/
            poolA1/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                sample_{barcode2}.1.fq.gz
                sample_{barcode2}.2.fq.gz
                sample_{barcode3}.1.fq.gz
                sample_{barcode3}.2.fq.gz
                ...
            poolA1_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                sample_{barcode2}.1.1.fq.gz
                sample_{barcode2}.2.2.fq.gz
                sample_{barcode3}.1.1.fq.gz
                sample_{barcode3}.2.2.fq.gz
                ...
            poolA2/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                ...
            poolA2_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                ...
            ...
        runB/
            poolB1/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                ...
            poolB1_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                ...
            ...
        ...        
    ```
* [04-all_samples](04-all_samples): paired end `fastq.gz` files are named according to [{species}_infos.csv](01-info_files) information. Then reads are aligned onto reference genome sequences stored into [08-genomes](08-genomes). This folder contains "named" fatsq files and corresponding alignments `.bam` files. `.sorted.bam` are SORTED alignment files and `.sorted.bam.bai` are corresponding index. The data must be stored this way :
    ```
peguerin's avatar
peguerin committed
186
    04-all_samples/
peguerin's avatar
peguerin committed
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
        speciesA/
           {sampleA1}.1.fq.gz
           {sampleA1}.2.fq.gz
           {sampleA1}.bam
           {sampleA1}.sorted.bam
           {sampleA1}.sorted.bam.bai
           {sampleA2}.1.fq.gz
           {sampleA2}.2.fq.gz
           {sampleA2}.bam
           {sampleA2}.sorted.bam
           {sampleA2}.sorted.bam.bai
           ...
        speciesB/
           {sampleB1}.1.fq.gz
           {sampleB1}.2.fq.gz
           {sampleB1}.bam
           {sampleB1}.sorted.bam
           {sampleB1}.sorted.bam.bai
           ...
        ...        
    ```
* [05-stacks](05-stacks) : outputs from [gstacks](http://catchenlab.life.illinois.edu/stacks/comp/gstacks.php)
* [06-populations](06-populations) : outputs from [populations](http://catchenlab.life.illinois.edu/stacks/comp/populations.php)
* [08-genomes](08-genomes) : reference genome of each any species {species} used for the analysis. `.fasta` file is mandatory and stores all the scaffolds sequences of {species} genome assembly. `.amb`, `.ann`, `.bwt`, `.pac`, `.sa` are index files required by [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial). They will be automatically generated if absent. The data must be stored this way :
    ```
    08-genomes/
          {species}_genome.amb
          {species}_genome.ann
          {species}_genome.bwt
          {species}_genome.fasta
          {species}_genome.pac
          {species}_genome.sa       
    ```
* [10-logs](10-logs) : log files generated by every command
    - process_radtags
    - clone_filter
    - genome_alignment
    - gstacks
    - populations
peguerin's avatar
peguerin committed
226

peguerin's avatar
peguerin committed
227
# 3. Reporting bugs
peguerin's avatar
peguerin committed
228
229
230
231
232

If you're sure you've found a bug — e.g. if one of my programs crashes
with an obscur error message, or if the resulting file is missing part
of the original data, then by all means submit a bug report.

peguerin's avatar
peguerin committed
233
I use [GitLab's issue system](http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2/issues)
peguerin's avatar
peguerin committed
234
235
236
237
238
239
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc


# 4. Running the pipeline

peguerin's avatar
peguerin committed
240
## 4.1 Initialisation
peguerin's avatar
peguerin committed
241
242
243
244
245
246
247
248
249
250
251

* open a shell
* make a folder, name it yourself, I named it workdir

```
mkdir workdir
cd workdir
```
* clone the project and switch to the main folder, it's your working directory

```
peguerin's avatar
peguerin committed
252
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
peguerin's avatar
peguerin committed
253
254
cd snakemake_stacks2
```
peguerin's avatar
peguerin committed
255
256
257
258
## 4.2 Configuration
WORK IN PROGRESS !!!!

## 4.3 Run the pipeline into a single command
peguerin's avatar
peguerin committed
259
Once you finished [Initialisation](#41-initialisation) and [Configuration](#42-configuration) steps. You can run the whole pipeline simply typing :
peguerin's avatar
peguerin committed
260
261
262
263
```
## number of CPU cores available for running the pipeline (for instance here 64 cores)
N_CORES=64
## run the pipeline into a single command
peguerin's avatar
peguerin committed
264
bash main.sh $N_CORES
peguerin's avatar
peguerin committed
265
```
peguerin's avatar
peguerin committed
266
267


peguerin's avatar
peguerin committed
268
## 4.4 Run the pipeline step by step
peguerin's avatar
peguerin committed
269
270
271
272
WORK IN PROGRESS !!!!

that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.

peguerin's avatar
peguerin committed
273
274
275