README.md 4.14 KB
Newer Older
eboulanger's avatar
eboulanger committed
1
2
# seaConnect--dataPrep

eboulanger's avatar
eboulanger committed
3
Scripts to prepare GBS data for analysis. 
eboulanger's avatar
eboulanger committed
4
- filtering steps for dDocent SNP output
eboulanger's avatar
eboulanger committed
5
- outlier detection
eboulanger's avatar
eboulanger committed
6
- file conversions, subsetting and renaming 
eboulanger's avatar
eboulanger committed
7

eboulanger's avatar
eboulanger committed
8
9
10
11
## Dependencies
You will need to install the following software:  
- [VCFtools](https://vcftools.github.io)
- [BCFtools](https://samtools.github.io/bcftools/)
eboulanger's avatar
eboulanger committed
12
- [PLINK v1.9](https://www.cog-genomics.org/plink/1.9/)
eboulanger's avatar
readme    
eboulanger committed
13
- [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/)
eboulanger's avatar
eboulanger committed
14
15

You will need to have the following R packages:  
eboulanger's avatar
readme    
eboulanger committed
16
adegenet, polysat, pegas, vcfR, hierfstat, coda, radiator, pcadapt
eboulanger's avatar
eboulanger committed
17
18


eboulanger's avatar
eboulanger committed
19
## 01-SNPfiltering 
eboulanger's avatar
eboulanger committed
20
21
22
23
24
25
26
27
28
29

script adapted from [ddocent tutorial](https://www.ddocent.com/filtering/) and additions

### run the script
move to the correct directory and make the output directories
```
cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
eboulanger's avatar
eboulanger committed
30

eboulanger's avatar
eboulanger committed
31
32
33
34
set the species-specific arguments for the script to run on  
  $1 = input file (specify path if not in current directory)  
  $2 = output folder  
  $3 = species code  
eboulanger's avatar
eboulanger committed
35
36
37
38
```
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
```
eboulanger's avatar
eboulanger committed
39
40
The final step consists of removing individuals with (extreme) outlier heterozygosity values. Here we define outliers as falling outside 6 * interquartile range. 
This theshold value is set in accordance to our data and it is advised to look at the outputted figures to validate this choice for your data.
eboulanger's avatar
eboulanger committed
41

eboulanger's avatar
eboulanger committed
42
43
44
45
46
47
48
49
50
51
### rename individuals for conventional naming system and create population map
this is adapted for our case, review R script or skip according to naming needs
```
bash renaming.sh 01-Diplodus/ dip
bash renaming.sh 02-Mullus/ mul
```
output:
- dip_all_filtered.vcf
- dip_popualtion_map_297ind.txt
- mul_all_filtered.vcf
eboulanger's avatar
eboulanger committed
52
- mul_population_map_467ind.txt
eboulanger's avatar
eboulanger committed
53

eboulanger's avatar
eboulanger committed
54
## 02-PCAdapt
eboulanger's avatar
eboulanger committed
55
56
57
58
59
60
61
62
63

detect outliers with [PCAdapt](https://bcm-uga.github.io/pcadapt/articles/pcadapt.html) in 
R and retrieve loci positions from vcf file

### Run the PCAdapt.R script

This script will load your vcf file, detect outlier loci using the package pcadapt and
output a list of outlier loci positions that can be used to subset your vcf file

eboulanger's avatar
eboulanger committed
64
65
EDIT: must run interactively to choose K for each species

eboulanger's avatar
eboulanger committed
66
67
68
set arguments :  
$1 = input file (vcf)  
$2 = species code  
eboulanger's avatar
eboulanger committed
69
70
71
72
73
74
75
76
77

#### for mullus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
see how many outliers were detected :
```
wc -l outl_pos_pcadpt_mul.txt
```
eboulanger's avatar
eboulanger committed
78
291 outliers detected for mullus
eboulanger's avatar
eboulanger committed
79
80
81
82
83
84
85

#### for diplodus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf dip

wc -l outl_pos_pcadpt_dip.txt
```
eboulanger's avatar
eboulanger committed
86
413 outliers detected for diplodus
eboulanger's avatar
eboulanger committed
87

eboulanger's avatar
eboulanger committed
88
## 03-finalPrep
eboulanger's avatar
eboulanger committed
89

eboulanger's avatar
eboulanger committed
90
Final steps to get neutral and adaptive SNP sets and correct file formats for both species.
eboulanger's avatar
eboulanger committed
91

eboulanger's avatar
eboulanger committed
92
This script subsets the filtered vcf file from 01-SNPfilters by outlier positions detected 
eboulanger's avatar
eboulanger committed
93
in 03-PCAdapt.  
eboulanger's avatar
eboulanger committed
94
95
It also subsets the same vcf file for the remaining neutral positions and applies a final 
filter for HWE.
eboulanger's avatar
eboulanger committed
96

eboulanger's avatar
eboulanger committed
97
98
99
100
Finally, the script converts the final adaptive and neutral .vcf files in .bed .bim .fam .raw
and .strct_in format necessary for downstream analyses.

The conversion to genepop format for use of GENODIVE (to calculate kinship) is done with
eboulanger's avatar
readme    
eboulanger committed
101
102
103
the PGDSpider GUI.   
input: neutral.vcf and population map, output: neutral.gen.txt  
outputted .gen.txt file: add information on first line (otherwise genodive won't recognise format)  
eboulanger's avatar
eboulanger committed
104

eboulanger's avatar
eboulanger committed
105
To run the script, set arguments:  
eboulanger's avatar
eboulanger committed
106
107
108
109
110
  $1 = input file (vcf)  
  $2 = species code  
  
#### for diplodus
```
eboulanger's avatar
eboulanger committed
111
112
113
114
mkdir 01-Diplodus
cp ../01-SNPfilters/01-Diplodus/dip_population_map_*.txt 01-Diplodus/
cp ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf 01-Diplodus/

eboulanger's avatar
eboulanger committed
115
bash outlier_positions.sh 01-Diplodus/dip_all_filtered.vcf dip
eboulanger's avatar
eboulanger committed
116
```
eboulanger's avatar
eboulanger committed
117

eboulanger's avatar
eboulanger committed
118
After HWE filter, 7655 neutral loci were retained.
eboulanger's avatar
eboulanger committed
119

eboulanger's avatar
eboulanger committed
120
121
#### for Mullus
```
eboulanger's avatar
eboulanger committed
122
123
124
125
mkdir 02-Mullus
cp ../01-SNPfilters/02-Mullus/mul_population_map_*.txt 02-Mullus/
cp ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf 02-Mullus/

eboulanger's avatar
eboulanger committed
126
bash outlier_positions.sh 02-Mullus/mul_all_filtered.vcf mul
eboulanger's avatar
eboulanger committed
127
```
eboulanger's avatar
eboulanger committed
128
After HWE filter, 2462 neutral loci were retained.
eboulanger's avatar
eboulanger committed
129

eboulanger's avatar
eboulanger committed
130

eboulanger's avatar
eboulanger committed
131
132
133
#### clean up files to separate directories
```
mv dip_* 01-Diplodus/
eboulanger's avatar
eboulanger committed
134
135
mv mul_* 02-Mullus/
```
eboulanger's avatar
eboulanger committed
136