Commit 95b20e37 authored by eboulanger's avatar eboulanger
Browse files

finalPrep and updates

parent 14602a3a
......@@ -6,6 +6,9 @@
*.sel
*.rad
.DS_Store
.Rhistory
00-rawData
01-SNPfilters/01-Diplodus/
......@@ -13,6 +16,6 @@
01-SNPfilters/filtering_dip.sh
01-SNPfilters/filtering_mul.sh
02-outlierDetection
03-finalPrep
04-finalPrep/01-Diplodus/
04-finalPrep/02-Mullus/
# SNP filtering
script adapted from [ddocent tutorial](tutorial: https://www.ddocent.com/filtering/)
## run the script
move to the correct directory and make the output directories
```
cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
$3 = species code
```
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
```
## SNP filtering results for Mullus surmuletus
### ddocent pipeline + additions
| filtering step | filter for | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0 | ddocent output data | 431 | 49027 | | mullus.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 431 | 49027 | 71.00 | mullus.g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 431 | 49027 | 74.00 | mullus.g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 424 | 49027 | 70.00 | mullus.g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 424 | 18727 | 14.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | | 17965 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | | **SKIP** | |
| step 7 | ration mapping qualities ref vs alternate alleles | | 17546 | | DP3g95maf05.fil3.vcf
| step 8 | paired status | | 17546 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | | 17546 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 424 | 15466 | |
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 424 | 15232 | 25 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | rename | | | | mul_all_filtered.vcf
## SNP filtering results for Diplodus sargus
### ddocent pipeline + additions
| filtering step | filter for | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0 | ddocent output data | 297 | 13362 | | diplodus.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 297 | 13362 | 14.00 | g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 297 | 13362 | 14.00 | g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 297 | 13362 | 16.00 | g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 297 | 10389 | 11.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | 297 | 10202 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | SKIP | **SKIP** | | SKIP
| step 7 | ration mapping qualities ref vs alternate alleles | 297 | 9689 | | DP3g95maf05.fil3.vcf
| step 8 | paired status | 297 | 9689 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | 297 | 9688 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 297 | 8325 | 11.00 |
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 297 | 8206 | 27 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | rename | | | | dip_all_filtered.vcf
#### Filtering steps by dDocent
# tutorial: https://www.ddocent.com/filtering/
# define arguments
FILE=$1
FOLDER=$2
SPCODE=$3
......
......@@ -8,9 +8,11 @@ library(coda)
library(vcfR)
# set arguments
spCode <- "dip"
runNum <- 1
inputName <- paste0(spCode,"_bayesc_2pop_run",runNum)
spCode <- "dip" # species
runNum <- 1 # run number
inputName <- paste0(spCode,"_bayesc_2pop_run",runNum) # 2pop or 3pop analysis (only for mullus)
#vcfPath <- "../01-SNPfilters/02-Mullus/mul_all_filtered.vcf"
vcfPath <- "../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf"
## model convergence ####
chain1 <- read.table(paste0(inputName,"_ch1.sel"),header=TRUE)
......@@ -55,16 +57,15 @@ results<-plot_bayescan(paste0(inputName,"_ch1_fst.txt"),FDR=0.05)
# detect outliers
outliers_bayescan <-results$outliers # position of outliers
results$nb_outliers #45 outliers
results$nb_outliers #68 outliers for mullus, 126 for diplodus
# extract corresponding SNP names and positions
vcf <- read.vcfR("../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf", verbose=F)
vcf <- read.vcfR(vcfPath, verbose=F)
loci <- as.data.frame(vcf@fix[,1:2])
outlier_loci <- loci[results$outliers,]
outlier_positions <- outlier_loci$POS
# output positions table
write.table(outlier_positions, file = paste0("outl_pos_bayesc_",args[2],".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_bayesc_", args[2]), quote = 0)
write.table(outlier_loci, file = paste0("outl_pos_bayesc_",spCode,".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_bayesc_", spCode,".txt"), quote = 0)
......@@ -3,8 +3,8 @@
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
#args <- c("../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf","dip")
#args <- c("../01-SNPfilters/02-Mullus/mullus_all_filtered.vcf","mul")
#args <- c("../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf","dip")
#args <- c("../01-SNPfilters/02-Mullus/mul_all_filtered.vcf","mul")
start <- Sys.time()
......@@ -16,7 +16,7 @@ library(radiator)
# define arguments
inputVcf <- args[1]
speciesCode <- args[2]
outputBayeScan <- paste0(speciesCode,"_bayesc_input_2pop")
outputBayeScan <- paste0(speciesCode,"_bayesc_input_3pop")
# import vcf file to genind
vcfFile <- read.vcfR(inputVcf, verbose = F)
......@@ -26,6 +26,8 @@ genind <- vcfR2genind(vcfFile)
# use DAPC groups as pop
grp <- find.clusters(genind, max.n.clust = 40)
400
2
grp$size
pop(genind) <- grp$grp
......
# Outlier detection
## with BayeScan
### step 1: convert vcf files to Bayescan .txt files
This script will load your vcf file, determine the population identifier for each individual, and return a bayescan .txt inputfile
set arguments : argument 1 = input file (vcf), argument 2 = species code
script is currently set to detect and assign two populations. Run the script interactively if you want to determine and assign other values of K
#### for mullus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/02-Mullus/mullus_all_filtered.vcf mul
```
#### for diplodus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf dip
```
### step 2: determine outliers with Bayescan
download and compile Bayescan from [here](http://cmpg.unibe.ch/software/BayeScan/download.html) and copy the executable to your local bin folder:
```
cp ~/programmes/BayeScan2.1/binaries/BayeScan2.1_macos64bits /usr/local/bin/
```
#### run the BayeScan model for mullus and diplodus data
multiple runs with different parameters and different input data, with two chains per run to compare convergence later
```
bash run_bayescan.sh
```
### step 3: verify convergence
Run interactive R script called `Bayescan_evaluation.R`
run 1 seems to get "best" results for diplodus. Neither runs detects outliers for mullus.
extract outlier list and export loci positions for later subsetting
......@@ -2,21 +2,24 @@
# first run, two chains per run
# PCAdapt detected between two and three populations for mullus so we run the model for both options
#BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
#BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
#BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
# second run, two chains per run
# here we try with a higher prior odds
#BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run2_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
#BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_2pop_run2_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run2_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
......
......@@ -4,6 +4,7 @@
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
#args <- c("../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf", "dip")
#args <- c("../01-SNPfilters/02-Mullus/mul_all_filtered.vcf", "mul")
# load libraries
library(pcadapt)
......@@ -52,8 +53,7 @@ print(paste(length(outliers), "outliers detected"),quote=0)
vcf <- read.vcfR(path_to_file, verbose=F)
loci <- as.data.frame(vcf@fix[,1:2])
outlier_loci <- loci[outliers,]
outlier_positions <- droplevels(outlier_loci$POS)
# output positions table
write.table(outlier_positions, file = paste0("outl_pos_pcadpt_",speciesCode,".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
write.table(outlier_loci, file = paste0("outl_pos_pcadpt_",speciesCode,".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_pcadpt_",speciesCode), quote = 0)
# detect outliers with PCAdapt in R and retrieve loci positions from vcf file
### Run the PCAdapt.R script
This script will load your vcf file, detect outlier loci using the package pcadapt and
output a list of outlier loci positions that can be used to subset your vcf file
set arguments : argument 1 = input file (vcf), argument 2 = species code
#### for mullus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
see how many outliers were detected :
```
wc -l outl_pos_pcadpt_mul.txt
```
2078 outliers detected for mullus
#### for diplodus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf dip
wc -l outl_pos_pcadpt_dip.txt
```
312 outliers detected for diplodus
## final steps to get neutral and adaptive SNP set
# adaptive SNPs
### step 1: create a final list of all outlier loci positions
```
cat ../02-Bayescan/outl_pos_bayesc_dip.txt ../03-PCAdapt/outl_pos_pcadpt_dip.txt > outl_pos_combi_dip.txt
uniq -d outl_pos_combi_dip.txt | wc -l
```
no diplodus outliers were detected by both methods
keep only unique positions:
```
uniq -u outl_pos_combi_dip.txt > outl_pos_dip_433.txt
```
### step 2: subset vcf file by outlier loci positions
```
vcftools --vcf ../01-SNPfilters/dip_all_filtered.vcf --positions outl_pos_dip_433.txt --recode-INFO-all --out diplodus_adaptive
mv diplodus_adaptive.recode.vcf diplodus_adaptive.vcf
```
# neutral SNPs
### step 1 : remove outlier loci
### step 2 : filter for HWE
step 11 : HWE
recommends by pop but since I have mostly one large pop: apply to all
```
vcftools --vcf dip_all_filtered.vcf --hwe 0.000001 --recode --recode-INFO-all --out mul.FIL.HFis.hwe
```
## Final steps to get neutral and adaptive SNP set and correct file formats
# define arguments
inputFile=$1
spCode=$2
#### adaptive SNPs ####
## step 1: create a final list of all outlier loci positions
cat ../02-Bayescan/outl_pos_bayesc_"$spCode".txt ../03-PCAdapt/outl_pos_pcadpt_"$spCode".txt > "$spCode"_outl_pos_combi.txt
# how many are detected by both methods?
nPosOutDup=`sort "$spCode"_outl_pos_combi.txt | uniq -d | wc -l`
echo ""$nPosOutDup" outlier loci detected with both methods"
# extract total number of unique outlier positions (for file naming)
nPosOut=`sort "$spCode"_outl_pos_combi.txt | uniq -u | wc -l`
echo ""$nPosOut" unique outlier loci in total"
# remove leading whitespaces (from here: https://www.cyberciti.biz/faq/bash-remove-whitespace-from-string/)
shopt -s extglob # turn it on
nPosOut="${nPosOut##*( )}"
shopt -u extglob # turn it off
# extract file with unique positions
sort "$spCode"_outl_pos_combi.txt | uniq -u > "$spCode"_outl_pos_"$nPosOut".txt
#rm "$spCode"_outl_pos_combi.txt
## step 2: subset filtered vcf file by outlier positions
vcftools --vcf "$inputFile" --positions "$spCode"_outl_pos_"$nPosOut".txt --recode --recode-INFO-all --out "$spCode"_adaptive_"$nPosOut"
mv "$spCode"_adaptive_"$nPosOut".recode.vcf "$spCode"_adaptive_"$nPosOut".vcf
## step 3 : convert vcf files to necessary formats for downstream analyses
# convert .vcf files to .bep files for ADMIXTURE
# .vcf to .tped
vcftools --vcf "$spCode"_adaptive_"$nPosOut".vcf --plink-tped --out "$spCode"_adaptive_"$nPosOut"
# .tped to .bed
plink --tped "$spCode"_adaptive_"$nPosOut".tped --tfam "$spCode"_adaptive_"$nPosOut".tfam --make-bed --out "$spCode"_adaptive_"$nPosOut" --noweb
# convert to minor allele frequencies for RDA and other analyses
# .tped to .raw
plink --tped "$spCode"_adaptive_"$nPosOut".tped --tfam "$spCode"_adaptive_"$nPosOut".tfam --recodeA --out "$spCode"_adaptive_"$nPosOut" --noweb
#### neutral SNPs ####
## step 1: create a final list of all neutral loci positions
# get list of all positions original vcf file
grep -v "^##" "$inputFile" | cut -f1-2 | sed '1d' > "$spCode"_all_pos.txt
# remove previously identified outlier positions to only retain neutral ones
cat "$spCode"_all_pos.txt "$spCode"_outl_pos_"$nPosOut".txt | sort | uniq -u > "$spCode"_ntrl_pos_preHWE.txt
## step 2 : subset filtered vcf file by neutral outlier positions
vcftools --vcf "$inputFile" --positions "$spCode"_ntrl_pos_preHWE.txt --recode --recode-INFO-all --out "$spCode"_neutral_preHWE
## step 3 : apply HWE filter
vcftools --vcf "$spCode"_neutral_preHWE.recode.vcf --hwe 0.000001 --recode --recode-INFO-all --out "$spCode"_neutral
nPosNtrl=`grep -c -v "^#" "$spCode"_neutral.recode.vcf`
mv "$spCode"_neutral.recode.vcf "$spCode"_neutral_"$nPosNtrl".vcf
mv "$spCode"_neutral.log "$spCode"_neutral_"$nPosNtrl".log
## step 4 : convert vcf files to necessary formats for downstream analyses
# convert .vcf files to .bep files for ADMIXTURE
# .vcf to .tped
vcftools --vcf "$spCode"_neutral_"$nPosNtrl".vcf --plink-tped --out "$spCode"_neutral_"$nPosNtrl"
# .tped to .bed
plink --tped "$spCode"_neutral_"$nPosNtrl".tped --tfam "$spCode"_neutral_"$nPosNtrl".tfam --make-bed --out "$spCode"_neutral_"$nPosNtrl" --noweb
# convert to minor allele frequencies for RDA and other analyses
# .tped to .raw
plink --tped "$spCode"_neutral_"$nPosNtrl".tped --tfam "$spCode"_neutral_"$nPosNtrl".tfam --recodeA --out "$spCode"_neutral_"$nPosNtrl" --noweb
......@@ -5,7 +5,7 @@ Scripts to prepare RADSeq data for analysis.
- outlier detection
- file conversions, subsetting and renaming
## 01-SNPfiltering
## 01-SNPfiltering
script adapted from [ddocent tutorial](https://www.ddocent.com/filtering/) and additions
......@@ -16,10 +16,10 @@ cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
$3 = species code
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
$3 = species code
```
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
......@@ -70,11 +70,11 @@ Detect Fst outliers by bayesian inference with the [BayeScan software](http://cm
This script will load your vcf file, determine the population identifier for each
individual, and return a bayescan .txt inputfile
set arguments :
$1 = input file (vcf)
$2 = species code
the script is currently set to detect and assign two populations (K). Run the script
interactively if you want to determine and assign other values of K
set arguments :
$1 = input file (vcf)
$2 = species code
The script is currently set to detect and assign two populations (K). Run the script
interactively if you want to determine and assign other values of K.
#### for mullus :
```
......@@ -115,9 +115,9 @@ R and retrieve loci positions from vcf file
This script will load your vcf file, detect outlier loci using the package pcadapt and
output a list of outlier loci positions that can be used to subset your vcf file
set arguments :
$1 = input file (vcf)
$2 = species code
set arguments :
$1 = input file (vcf)
$2 = species code
#### for mullus :
```
......@@ -139,9 +139,36 @@ wc -l outl_pos_pcadpt_dip.txt
## 04-finalPrep
combine outlier detection results from Bayescan and PCAdapt and subset your original vcf
file to get a neutral SNP dataset and an adaptive SNP dataset
Final steps to get neutral and adaptive SNP sets and correct file formats for both species.
### adaptive loci
This script subsets the filtered vcf file from 01-SNPfilters by outlier positions detected
in 02-BayeScan and 03-PCAdapt.
It also subsets the same vcf file for the remaining neutral positions and applies a final
filter for HWE.
### neutral loci
\ No newline at end of file
Finally, the script converts the final adaptive and neutral .vcf files in .tped, .tfam,
.bed and .raw format necessary for downstream analyses.
set arguments:
$1 = input file (vcf)
$2 = species code
#### for diplodus
```
bash outlier_positions.sh ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf dip
```
#### for Mullus
```
bash outlier_positions.sh ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
In total, 2680 adaptive loci were detected, with 10 loci detected by both the BayeScan and PCAdapt method.
After HWE filter, 12432 neutral loci were retained.
#### clean up files to separate directories
```
mkdir 01-Diplodus
mv dip_* 01-Diplodus/
mkdir 02-Mullus
mv mul_* 02-Mullus
```
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment