README.md 10.4 KB
Newer Older
eboulanger's avatar
eboulanger committed
1
2
# seaConnect--dataPrep

eboulanger's avatar
eboulanger committed
3
Scripts to prepare GBS data for analysis. 
eboulanger's avatar
eboulanger committed
4
- filtering steps for dDocent SNP output
eboulanger's avatar
eboulanger committed
5
- outlier detection
eboulanger's avatar
eboulanger committed
6
- file conversions, subsetting and renaming 
eboulanger's avatar
eboulanger committed
7

eboulanger's avatar
eboulanger committed
8
9
10
11
## Dependencies
You will need to install the following software:  
- [VCFtools](https://vcftools.github.io)
- [BCFtools](https://samtools.github.io/bcftools/)
eboulanger's avatar
eboulanger committed
12
- [PLINK v1.9](https://www.cog-genomics.org/plink/1.9/)
eboulanger's avatar
readme    
eboulanger committed
13
- [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/)
eboulanger's avatar
eboulanger committed
14
15

You will need to have the following R packages:  
eboulanger's avatar
readme    
eboulanger committed
16
adegenet, polysat, pegas, vcfR, hierfstat, coda, radiator, pcadapt
eboulanger's avatar
eboulanger committed
17
18


eboulanger's avatar
eboulanger committed
19
## 01-SNPfiltering 
eboulanger's avatar
eboulanger committed
20
21
22
23
24
25
26
27
28
29

script adapted from [ddocent tutorial](https://www.ddocent.com/filtering/) and additions

### run the script
move to the correct directory and make the output directories
```
cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
eboulanger's avatar
eboulanger committed
30
31
32
33
34
start a log of the terminal output
```
script 01-Diplodus/filtering.log
script 02-Mullus/filtering.log
```
eboulanger's avatar
eboulanger committed
35
36
37
38
set the species-specific arguments for the script to run on  
  $1 = input file (specify path if not in current directory)  
  $2 = output folder  
  $3 = species code  
eboulanger's avatar
eboulanger committed
39
40
41
42
```
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
```
eboulanger's avatar
eboulanger committed
43
44
45
46
when all steps are finished: finish the terminal output log
```
exit
```
eboulanger's avatar
eboulanger committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

### SNP filtering results for Mullus surmuletus					
				
| filtering step | filter for                                        | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0         | ddocent output data                               | 431                  | 49027         |                | mullus.vcf
| step 1         | call below 50%, mac < 3, min quality score = 30   | 431                  | 49027         | 71.00          | mullus.g5mac3.recode.vcf
| step 2         | min mean depth genotype call = 3                  | 431                  | 49027         | 74.00          | mullus.g5mac3dp3.recode.vcf
| step 3         | individuals > 50% missing data                    | 424                  | 49027         | 70.00          | mullus.g5mac3dplm.recode.vcf
| step 4         | sites > 5% missing data, maf 0.05, min meanDP = 5 | 424                  | 18727         | 14.00          | DP3g95maf05.recode.vcf
| step 5         | filter for allele balance                         |                      | 17965         |                | DP3g95maf05.fil1.vcf
| step 6         | filter out sites with reads from both strands     |                      | **SKIP**      |	             | 
| step 7         | ration mapping qualities ref vs alternate alleles |                      | 17546         |                | DP3g95maf05.fil3.vcf
| step 8         | paired status                                     |                      | 17546         |                | DP3g95maf05.fil4.vcf
| step 9         | remove sites quality score < 1/4 depth            |                      | 17546         |                | DP3g95maf05.fil5.vcf
| step 10        | depth x quality score cutoff	                     | 424                  | 15466         |	             | 
| step 11        | He > 0.6 & Fis > 0.5 & Fix < -0.5                 | 424                  | 15232         | 25 min         | DP3g95maf05.FIL.HFis.recode.vcf
eboulanger's avatar
eboulanger committed
64
| step 12        | remove extreme outliers individual O HET          | 413                  | 15232         | 23.00          | DP3g95maf05.FIL.HFis.indHet.recode.vcf
eboulanger's avatar
eboulanger committed
65
| step 13        | rename                                            |                      |               |                | mul_all_filtered_origid.vcf
eboulanger's avatar
eboulanger committed
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

### SNP filtering results for Diplodus sargus								
				
| filtering step | filter for                                        | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0         | ddocent output data                               | 297                  | 13362         |                | sar_ddocent_.vcf
| step 1         | call below 50%, mac < 3, min quality score = 30   | 297                  | 13362         | 14.00          | g5mac3.recode.vcf
| step 2         | min mean depth genotype call = 3                  | 297                  | 13362         | 14.00          | g5mac3dp3.recode.vcf
| step 3         | individuals > 50% missing data                    | 297                  | 13362         | 16.00          | g5mac3dplm.recode.vcf
| step 4         | sites > 5% missing data, maf 0.05, min meanDP = 5 | 297                  | 10389         | 11.00          | DP3g95maf05.recode.vcf
| step 5         | filter for allele balance                         | 297                  | 10202         |                | DP3g95maf05.fil1.vcf
| step 6         | filter out sites with reads from both strands     | SKIP                 | **SKIP**      |                | SKIP
| step 7         | ration mapping qualities ref vs alternate alleles | 297                  | 9689          |                | DP3g95maf05.fil3.vcf
| step 8         | paired status                                     | 297                  | 9689          |                | DP3g95maf05.fil4.vcf
| step 9         | remove sites quality score < 1/4 depth            | 297                  | 9688          |                | DP3g95maf05.fil5.vcf
| step 10        | depth x quality score cutoff	                     | 297                  | 8325          | 11.00          | 
| step 11        | He > 0.6 & Fis > 0.5 & Fix < -0.5                 | 297                  | 8206          | 27 min         | DP3g95maf05.FIL.HFis.recode.vcf
eboulanger's avatar
eboulanger committed
83
84
| step 12        | remove extreme outliers individual O HET          | 297                  |               |                | DP3g95maf05.FIL.HFis.indHet.recode.vcf
| step 13        | rename                                            |                      |               |                | dip_all_filtered_origid.vcf 
eboulanger's avatar
eboulanger committed
85

eboulanger's avatar
eboulanger committed
86
87
88
89
90
91
92
93
94
95
96
### rename individuals for conventional naming system and create population map
this is adapted for our case, review R script or skip according to naming needs
```
bash renaming.sh 01-Diplodus/ dip
bash renaming.sh 02-Mullus/ mul
```
output:
- dip_all_filtered.vcf
- dip_popualtion_map_297ind.txt
- mul_all_filtered.vcf
- mul_population_map_413ind.txt
eboulanger's avatar
eboulanger committed
97
98
99
100
101
102
103
104
105

## 02-Bayescan

Detect Fst outliers by bayesian inference with the [BayeScan software](http://cmpg.unibe.ch/software/BayeScan/)

### step 1: convert vcf files to Bayescan .txt files

This script will load your vcf file, determine the population identifier for each 
individual, and return a bayescan .txt inputfile
eboulanger's avatar
eboulanger committed
106
107
108
109
110
set arguments :    
  $1 = input file (vcf)  
  $2 = species code  
The script is currently set to detect and assign two populations (K). Run the script 
interactively if you want to determine and assign other values of K.
eboulanger's avatar
eboulanger committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

#### for mullus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```

#### for diplodus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/01-Diplodus/dips_all_filtered.vcf dip
```

### step 2: determine outliers with Bayescan

download and compile Bayescan from [here](http://cmpg.unibe.ch/software/BayeScan/download.html) 
and copy the executable to your local bin folder:
```
cp ~/programmes/BayeScan2.1/binaries/BayeScan2.1_macos64bits /usr/local/bin/
```
#### run the BayeScan model for mullus and diplodus data
multiple runs with different parameters and different input data, with two chains per run 
to compare convergence later
```
bash run_bayescan.sh
```
### step 3: verify convergence and extract outliers
Run interactive R script called `Bayescan_evaluation.R`

eboulanger's avatar
eboulanger committed
138
The script also extracts outlier lists for the different runs and export loci positions for later subsetting (with run index)
eboulanger's avatar
eboulanger committed
139
140
141
142
143
144
145
146
147
148
149

## 03-PCAdapt

detect outliers with [PCAdapt](https://bcm-uga.github.io/pcadapt/articles/pcadapt.html) in 
R and retrieve loci positions from vcf file

### Run the PCAdapt.R script

This script will load your vcf file, detect outlier loci using the package pcadapt and
output a list of outlier loci positions that can be used to subset your vcf file

eboulanger's avatar
eboulanger committed
150
151
152
set arguments :  
$1 = input file (vcf)  
$2 = species code  
eboulanger's avatar
eboulanger committed
153
154
155
156
157
158
159
160
161

#### for mullus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
see how many outliers were detected :
```
wc -l outl_pos_pcadpt_mul.txt
```
eboulanger's avatar
eboulanger committed
162
1327 outliers detected for mullus
eboulanger's avatar
eboulanger committed
163
164
165
166
167
168
169
170
171
172
173

#### for diplodus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf dip

wc -l outl_pos_pcadpt_dip.txt
```
388 outliers detected for diplodus

## 04-finalPrep

eboulanger's avatar
eboulanger committed
174
Final steps to get neutral and adaptive SNP sets and correct file formats for both species.
eboulanger's avatar
eboulanger committed
175

eboulanger's avatar
eboulanger committed
176
177
178
179
This script subsets the filtered vcf file from 01-SNPfilters by outlier positions detected 
in 02-BayeScan and 03-PCAdapt.  
It also subsets the same vcf file for the remaining neutral positions and applies a final 
filter for HWE.
eboulanger's avatar
eboulanger committed
180

eboulanger's avatar
eboulanger committed
181
182
183
184
Finally, the script converts the final adaptive and neutral .vcf files in .bed .bim .fam .raw
and .strct_in format necessary for downstream analyses.

The conversion to genepop format for use of GENODIVE (to calculate kinship) is done with
eboulanger's avatar
readme    
eboulanger committed
185
186
187
the PGDSpider GUI.   
input: neutral.vcf and population map, output: neutral.gen.txt  
outputted .gen.txt file: add information on first line (otherwise genodive won't recognise format)  
eboulanger's avatar
eboulanger committed
188

eboulanger's avatar
eboulanger committed
189
To run the script, set arguments:  
eboulanger's avatar
eboulanger committed
190
191
  $1 = input file (vcf)  
  $2 = species code  
eboulanger's avatar
eboulanger committed
192
  $3 = bayescan run index
eboulanger's avatar
eboulanger committed
193
194
195
  
#### for diplodus
```
eboulanger's avatar
eboulanger committed
196
197
198
199
200
mkdir 01-Diplodus
cp ../01-SNPfilters/01-Diplodus/dip_population_map_*.txt 01-Diplodus/
cp ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf 01-Diplodus/

bash outlier_positions.sh 01-Diplodus/dip_all_filtered.vcf dip run1
eboulanger's avatar
eboulanger committed
201
```
eboulanger's avatar
eboulanger committed
202
203
204
205

In total, 494 outlier loci were detected, with 10 loci detected by both the BayeScan and PCAdapt method.
After HWE filter, 7570 neutral loci were retained.

eboulanger's avatar
eboulanger committed
206
207
#### for Mullus
```
eboulanger's avatar
eboulanger committed
208
209
210
211
212
mkdir 02-Mullus
cp ../01-SNPfilters/02-Mullus/mul_population_map_*.txt 02-Mullus/
cp ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf 02-Mullus/

bash outlier_positions.sh 02-Mullus/mul_all_filtered.vcf mul run1
eboulanger's avatar
eboulanger committed
213
214
215
216
```
In total, 2680 adaptive loci were detected, with 10 loci detected by both the BayeScan and PCAdapt method.
After HWE filter, 12432 neutral loci were retained.

eboulanger's avatar
eboulanger committed
217

eboulanger's avatar
eboulanger committed
218
219
220
#### clean up files to separate directories
```
mv dip_* 01-Diplodus/
eboulanger's avatar
eboulanger committed
221
222
mv mul_* 02-Mullus/
```
eboulanger's avatar
eboulanger committed
223