README.md 9.31 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
3
4
5
6
7
8
9
10
11
# STACKS2 using SNAKEMAKE Workflow

RADseq workflow using [STACKS2](http://creskolab.uoregon.edu/stacks/)
This was designed to process RADseq data from [RESERVEBENEFIT](https://www.biodiversa.org/1023) project.



# Table of contents

1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
peguerin's avatar
peguerin committed
12
13
14
    1. [Prerequisite](#21-prerequisite)
    2. [Data Files](#22-data-files)
    3. [Set up](#23-set-up)
peguerin's avatar
peguerin committed
15
3. [Reporting bugs](#3-reporting-bugs)
peguerin's avatar
peguerin committed
16
4. [Running the pipeline](#5-running-the-pipeline)
peguerin's avatar
peguerin committed
17
18
19
20
    1. [Initialisation](#41-initialisation)
    2. [Configuration](#42-configuration)
    3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command)
    4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step)
peguerin's avatar
peguerin committed
21
22
23
24
25
26
27
28
29
30
31
32
33
34


# 1. Introduction

blablabla


# 2. Installation


## 2.1 Prerequisite
You must install the following softwares and packages :

- [SNAKEMAKE 5.3.0](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
peguerin's avatar
peguerin committed
35
    * Check version and if the program is correctly installed by typing :
peguerin's avatar
peguerin committed
36

peguerin's avatar
peguerin committed
37
38
39
40
41
    ```
    snakemake --version
    ## should give you the output
    5.3.0
    ```
peguerin's avatar
peguerin committed
42

peguerin's avatar
peguerin committed
43
- [STACKS 2.2](http://catchenlab.life.illinois.edu/stacks/)
peguerin's avatar
peguerin committed
44
   * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
45

peguerin's avatar
peguerin committed
46
47
48
49
50
51
    ```
    process_radtags --version
    clone_filter --version
    gstacks --version
    populations --version
    ## should give you the output
peguerin's avatar
peguerin committed
52
    2.2
peguerin's avatar
peguerin committed
53
    ```
peguerin's avatar
peguerin committed
54
55

- [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial)
peguerin's avatar
peguerin committed
56
    * Download `bwa` at: http://sourceforge.net/projects/bio-bwa/files/
peguerin's avatar
peguerin committed
57

peguerin's avatar
peguerin committed
58
59
60
61
62
63
64
    ```
    tar -xvf bwa-x.x.x.tar.bz2   
    cd bwa-x.x.x
    ./configure --prefix=/where/to/install
    make  
    make install
    ```
peguerin's avatar
peguerin committed
65
    * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
66

peguerin's avatar
peguerin committed
67
68
69
70
71
72
73
    ```
    bwa
    ## should give you the output
    Program: bwa (alignment via Burrows-Wheeler transformation)
    Version: 0.7.17-r1188
    ...
    ```
peguerin's avatar
peguerin committed
74
75
76
- [SAMTOOLS 1.9 ](http://www.htslib.org/)
    * Download `htslib` and `samtools` at : http://www.htslib.org/download/
    * Building each desired package from source is very simple:
peguerin's avatar
peguerin committed
77

peguerin's avatar
peguerin committed
78
79
80
81
82
83
84
85
86
87
88
89
90
    ```
    cd htslib-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    cd ..
    ## and similarly for samtools :
    cd samtools-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    ```
    * Check version and if programs are correctly installed by typing :
peguerin's avatar
peguerin committed
91

peguerin's avatar
peguerin committed
92
93
94
95
96
97
98
    ```
    samtools --version
    ## should give you the output
    samtools 1.9
    Using htslib 1.9
    Copyright (C) 2018 Genome Research Ltd.
    ```
peguerin's avatar
peguerin committed
99
100
101

## 2.2 Data Files
The included data files are :
peguerin's avatar
peguerin committed
102
103
104
105
let's define some wildcards `*`
- `{run}` : any runs
- `{pool}` : any pools into a run
- `{species}` : any species
peguerin's avatar
peguerin committed
106

peguerin's avatar
peguerin committed
107
* [config.yaml](01-info_files/config.yaml) : defines a dictionary of configuration parameters and their values used on each step commands of the pipeline.
peguerin's avatar
peguerin committed
108
109
110
* [barcodes.txt](01-info_files/barcodes.txt) : file containing barcodes used for {pool} into {run}
* [{species}_infos.csv](01-info_files) : information `.csv` table related to {species} each row is a sample and they are 4 columns which are run,pool,barcode,ID 
* [{species}_populations_map.txt](01-info_files) : information table `.tsv` related to {species}. Each row is a sample and they are 2 columns which are ID,population. This file can be generated by the pipeline (see [Configuration](#42-configuration) section). However we strongly recommand you to do it manually.
peguerin's avatar
peguerin committed
111
112
113
114
115

## 2.3 Set Up

clone the project and switch to the main folder, it's your working directory
```
peguerin's avatar
peguerin committed
116
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
peguerin's avatar
peguerin committed
117
118
cd snakemake_stacks2
```
peguerin's avatar
peguerin committed
119
120
You will see the following folders :

peguerin's avatar
peguerin committed
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
* [00-scripts](00-scripts): contains all the required scripts to run the whole pipeline
* [01-info_files](01-info_files) : contains all the required data files (see [Data Files](#22-data-files) section below)
* [02-raw](02-raw) : must contain your data from paired-end illumina sequencing runs. The data must be stored this way :
    ```
    02-raw/
        runA/
            poolA1/
                {poolA1}_R1_001.fastq.gz
                {poolA1}_R2_001.fastq.gz
            poolA2/
                {poolA2}_R1_001.fastq.gz
                {poolA2}_R2_001.fastq.gz
            ...
        runB/
            poolB1/
                {poolB1}_R1_001.fastq.gz
                {poolB1}_R2_001.fastq.gz
            ...
        ...        
    ```
* [03-samples](03-samples): will store the results generated by demultiplexing with [process_radtags](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) and clone filtering [clone_filter](http://catchenlab.life.illinois.edu/stacks/comp/clone_filter.php). The data must be stored this way :
   ```
peguerin's avatar
peguerin committed
143
    03-samples/
peguerin's avatar
peguerin committed
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
        runA/
            poolA1/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                sample_{barcode2}.1.fq.gz
                sample_{barcode2}.2.fq.gz
                sample_{barcode3}.1.fq.gz
                sample_{barcode3}.2.fq.gz
                ...
            poolA1_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                sample_{barcode2}.1.1.fq.gz
                sample_{barcode2}.2.2.fq.gz
                sample_{barcode3}.1.1.fq.gz
                sample_{barcode3}.2.2.fq.gz
                ...
            poolA2/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                ...
            poolA2_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                ...
            ...
        runB/
            poolB1/
                sample_{barcode1}.1.fq.gz
                sample_{barcode1}.2.fq.gz
                ...
            poolB1_clone_filtered/
                sample_{barcode1}.1.1.fq.gz
                sample_{barcode1}.2.2.fq.gz
                ...
            ...
        ...        
    ```
* [04-all_samples](04-all_samples): paired end `fastq.gz` files are named according to [{species}_infos.csv](01-info_files) information. Then reads are aligned onto reference genome sequences stored into [08-genomes](08-genomes). This folder contains "named" fatsq files and corresponding alignments `.bam` files. `.sorted.bam` are SORTED alignment files and `.sorted.bam.bai` are corresponding index. The data must be stored this way :
    ```
peguerin's avatar
peguerin committed
184
    04-all_samples/
peguerin's avatar
peguerin committed
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
        speciesA/
           {sampleA1}.1.fq.gz
           {sampleA1}.2.fq.gz
           {sampleA1}.bam
           {sampleA1}.sorted.bam
           {sampleA1}.sorted.bam.bai
           {sampleA2}.1.fq.gz
           {sampleA2}.2.fq.gz
           {sampleA2}.bam
           {sampleA2}.sorted.bam
           {sampleA2}.sorted.bam.bai
           ...
        speciesB/
           {sampleB1}.1.fq.gz
           {sampleB1}.2.fq.gz
           {sampleB1}.bam
           {sampleB1}.sorted.bam
           {sampleB1}.sorted.bam.bai
           ...
        ...        
    ```
* [05-stacks](05-stacks) : outputs from [gstacks](http://catchenlab.life.illinois.edu/stacks/comp/gstacks.php)
* [06-populations](06-populations) : outputs from [populations](http://catchenlab.life.illinois.edu/stacks/comp/populations.php)
* [08-genomes](08-genomes) : reference genome of each any species {species} used for the analysis. `.fasta` file is mandatory and stores all the scaffolds sequences of {species} genome assembly. `.amb`, `.ann`, `.bwt`, `.pac`, `.sa` are index files required by [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial). They will be automatically generated if absent. The data must be stored this way :
    ```
    08-genomes/
          {species}_genome.amb
          {species}_genome.ann
          {species}_genome.bwt
          {species}_genome.fasta
          {species}_genome.pac
          {species}_genome.sa       
    ```
* [10-logs](10-logs) : log files generated by every command
    - process_radtags
    - clone_filter
    - genome_alignment
    - gstacks
    - populations
peguerin's avatar
peguerin committed
224

peguerin's avatar
peguerin committed
225
# 3. Reporting bugs
peguerin's avatar
peguerin committed
226
227
228
229
230

If you're sure you've found a bug — e.g. if one of my programs crashes
with an obscur error message, or if the resulting file is missing part
of the original data, then by all means submit a bug report.

peguerin's avatar
peguerin committed
231
I use [GitLab's issue system](http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2/issues)
peguerin's avatar
peguerin committed
232
233
234
235
236
237
as my bug database. You can submit your bug reports there. Please be as
verbose as possible — e.g. include the command line, etc


# 4. Running the pipeline

peguerin's avatar
peguerin committed
238
## 4.1 Initialisation
peguerin's avatar
peguerin committed
239
240
241
242
243
244
245
246
247
248
249

* open a shell
* make a folder, name it yourself, I named it workdir

```
mkdir workdir
cd workdir
```
* clone the project and switch to the main folder, it's your working directory

```
peguerin's avatar
peguerin committed
250
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
peguerin's avatar
peguerin committed
251
252
cd snakemake_stacks2
```
peguerin's avatar
peguerin committed
253
254
255
256
## 4.2 Configuration
WORK IN PROGRESS !!!!

## 4.3 Run the pipeline into a single command
peguerin's avatar
peguerin committed
257
Once you finished [Initialisation](#41-initialisation) and [Configuration](#42-configuration) steps. You can run the whole pipeline simply typing :
peguerin's avatar
peguerin committed
258
259
260
261
```
## number of CPU cores available for running the pipeline (for instance here 64 cores)
N_CORES=64
## run the pipeline into a single command
peguerin's avatar
peguerin committed
262
bash main.sh $N_CORES
peguerin's avatar
peguerin committed
263
```
peguerin's avatar
peguerin committed
264
265


peguerin's avatar
peguerin committed
266
## 4.4 Run the pipeline step by step
peguerin's avatar
peguerin committed
267
268
269
270
WORK IN PROGRESS !!!!

that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.

peguerin's avatar
peguerin committed
271
272
273