Commit 89ce9f75 authored by eboulanger's avatar eboulanger
Browse files

new scripts and README

parent cf783ade
# SNP filtering results for Mullus surmuletus
# SNP filtering
script adapted from [ddocent tutorial](tutorial: https://www.ddocent.com/filtering/)
## run the script
move to the correct directory and make the output directories
```
cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
$3 = species code
```
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
```
## SNP filtering results for Mullus surmuletus
### ddocent pipeline + additions
| filtering step | filter for | individuals retained | SNPs retained | run time (sec) | output |
......@@ -14,19 +34,18 @@
| step 8 | paired status | | 17546 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | | 17546 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 424 | 15466 | |
| step 11 | HWE 0.000001 | 424 | 14532 | 26.00 | DP3g95maf05.FIL.hwe.recode.vcf
| step 12 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 424 | 14530 | 25 min |
| step 13 | rename | | | | mullus_all_filtered.vcf
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 424 | 15232 | 25 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | rename | | | | mul_all_filtered.vcf
# SNP filtering results for Diplodus sargus
## SNP filtering results for Diplodus sargus
### ddocent pipeline + additions
| filtering step | filter for | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0 | ddocent output data | 297 | 13362 | | diplodus.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 297 | 13362 | 14.00 | diplodus.g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 297 | 13362 | 14.00 | diplodus.g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 297 | 13362 | 16.00 | diplodus.g5mac3dplm.recode.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 297 | 13362 | 14.00 | g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 297 | 13362 | 14.00 | g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 297 | 13362 | 16.00 | g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 297 | 10389 | 11.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | 297 | 10202 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | SKIP | **SKIP** | | SKIP
......@@ -34,6 +53,5 @@
| step 8 | paired status | 297 | 9689 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | 297 | 9688 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 297 | 8325 | 11.00 |
| step 11 | HWE 0.000001 | 297 | 7978 | 7.00 | DP3g95maf05.FIL.hwe.recode.vcf
| step 12 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 297 | 7973 | 8 min |
| step 13 | rename | | | | diplodus_all_filtered.vcf
\ No newline at end of file
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 297 | 8206 | 27 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | rename | | | | dip_all_filtered.vcf
#### Filtering steps by dDocent
# tutorial: https://www.ddocent.com/filtering/
FILE=$1
FOLDER=$2
SPCODE=$3
PATH=$PATH:~/programmes/vcflib/bin/
cd $FOLDER
# step 1: filter genotypes called below 50% (across all individuals),
# filter SNPs that have a minor allele count less than 3
# filter SNPs with a minimum quality score of 30
vcftools --vcf ../$FILE --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out g5mac3
# step 2: minimum depth for a genotype call
# minimum mean depth
vcftools --vcf g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out g5mac3dp3
# step 3: asses individual levels of missing data and remove individuals accordingly
vcftools --vcf g5mac3dp3.recode.vcf --missing-indv
cat out.imiss
# make histogram of percentage missing data
awk '!/IN/' out.imiss | cut -f5 > totalmissing
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
#set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF
# create list of individuals with more than 50% missing data
awk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv
# remove individuals from list
vcftools --vcf g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out g5mac3dplm
# step 4 : restrict data to variants called in a high percentage of individuals and filter by mean depth of genotypes
vcftools --vcf g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 5
###
# FreeBayes - specific filters
# look at header
awk '/#/' DP3g95maf05.recode.vcf
# step 5 : filter for allele balance
# this requires vcflib
# population level: vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf
vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95maf05.recode.vcf > DP3g95maf05.fil1.vcf
# check how many were removed
awk '!/#/' DP3g95maf05.recode.vcf | wc -l
awk '!/#/' DP3g95maf05.fil1.vcf | wc -l
# step 6 : filter out sites that have reads from both strands
# vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95maf05.fil1.vcf > DP3g95maf05.fil2.vcf
# awk '!/#/' DP3g95maf05.fil2.vcf | wc -l
# SKIP THIS STEP BECAUSE REMOVES TOO MANY SNPs FOR MULLUS
# step 7 : look at ration of mapping qualities between reference and alternate alleles
vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" DP3g95maf05.fil1.vcf > DP3g95maf05.fil3.vcf
awk '!/#/' DP3g95maf05.fil3.vcf | wc -l
# step 8 : check whether there is discrepancy in the properly paired status (SNAP IK NIET MAAR MAAKT BLIJKBAAR NIET UIT WANT GEEN SNPs WEGGEFILTERD)
vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s DP3g95maf05.fil3.vcf > DP3g95maf05.fil4.vcf
awk '!/#/' DP3g95maf05.fil4.vcf | wc -l
# step 9 : remove any locus that has a quality score below 1/4 of the depth
# (read suggested blogposts to understand why)
vcffilter -f "QUAL / DP > 0.25" DP3g95maf05.fil4.vcf > DP3g95maf05.fil5.vcf
awk '!/#/' DP3g95maf05.fil5.vcf | wc -l
# step 10 : consists of multiple steps
## 1 : create list of depth of each locus
cut -f8 DP3g95maf05.fil5.vcf | grep -oe "DP=[0-9]*" | sed 's/DP=//g' > DP3g95maf05.fil5.DEPTH
## 2 : create list of quality scores
awk '!/#/' DP3g95maf05.fil5.vcf | cut -f1,2,6 > DP3g95maf05.fil5.vcf.loci.qual
## 3 : calculate mean depth
MEAN_DP=`awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' DP3g95maf05.fil5.DEPTH`
## 4 : take the mean depth plus 3X the square of the mean
MEAN_DP_SQ=`python -c "print int($MEAN_DP+3*($MEAN_DP**0.5))"`
echo $MEAN_DP_SQ
## 5 : paste depth and quality files together and find the loci above the cutoff that do not have quality scores 2 times the depth
paste DP3g95maf05.fil5.vcf.loci.qual DP3g95maf05.fil5.DEPTH | awk -v x=$MEAN_DP_SQ '$4 > x' | awk '$3 < 2 * $4' > DP3g95maf05.fil5.lowQDloci
## 6 : remove those sites and recalculate depth across loci
vcftools --vcf DP3g95maf05.fil5.vcf --site-depth --exclude-positions DP3g95maf05.fil5.lowQDloci --out DP3g95maf05.fil5
## 7 : cut depth scores, calculate average depth and plot
NINDIV=`grep "#CHROM" DP3g95maf05.recode.vcf | tr "\t" "\n " | tail -n +10 | wc -l ` # get the number of individuals
cut -f3 DP3g95maf05.fil5.ldepth > DP3g95maf05.fil5.site.depth
awk '!/D/' DP3g95maf05.fil5.site.depth | awk -v x="$NINDIV" '{print $1/x}' > meandepthpersite
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
set xrange [10:150]
unset label
set title "Histogram of mean depth per site"
set ylabel "Number of Occurrences"
set xlabel "Mean Depth"
binwidth=1
bin(x,width)=width*floor(x/width) + binwidth/2.0
set xtics 5
plot 'meandepthpersite' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF
## 8 : combine all to get new vcf
vcftools --vcf DP3g95maf05.fil5.vcf --recode-INFO-all --out DP3g95maf05.FIL --max-meanDP 102.5 --exclude-positions DP3g95maf05.fil5.lowQDloci --recode
# step 11 : remove SNPs for which Ho > 0.6 and Fis < -0.5 or Fis > 0.5
# find to-remove (/ to-keep) SNP positions in R using package Hierfstat
Rscript --vanilla ../filterstep_Ho6_Fis5.R DP3g95maf05.FIL.recode.vcf positions_HoFis_$SPCODE.txt
vcftools --vcf DP3g95maf05.FIL.recode.vcf --positions positions_HoFis_$SPCODE.txt --recode --recode-INFO-all --out DP3g95maf05.FIL.HFis
# step 13 : rename final filtered dataset
cp DP3g95maf05.FIL.HFis.recode.vcf "$SPCODE"_all_filtered.vcf
......@@ -3,9 +3,9 @@ ddocent pipeline + additions
filtering step filter for individuals retained SNPs retained run time (sec) output
step 0 ddocent output data 297 13362 sar_ddocent.vcf
step 1 call below 50%, mac < 3, min quality score = 30 297 13362 14.00 diplodus.g5mac3.recode.vcf
step 2 min mean depth genotype call = 3 297 13362 14.00 diplodus.g5mac3dp3.recode.vcf
step 3 remove individuals > 50% missing data 297 13362 16.00 diplodus.g5mac3dplm.recode.vcf
step 1 call below 50%, mac < 3, min quality score = 30 297 13362 14.00 g5mac3.recode.vcf
step 2 min mean depth genotype call = 3 297 13362 14.00 g5mac3dp3.recode.vcf
step 3 remove individuals > 50% missing data 297 13362 16.00 g5mac3dplm.recode.vcf
step 4 remove sites > 5% missing data, maf 0.05, min meanDP = 5 297 10389 11.00 DP3g95maf05.recode.vcf
step 5 filter for allele balance 297 10202 DP3g95maf05.fil1.vcf
step 6 filter out sites with reads from both strands SKIP SKIP SKIP
......@@ -13,6 +13,5 @@ step 7 ration of mapping qualities reference vs alternate alleles 297 9689 DP3g
step 8 paired status 297 9689 DP3g95maf05.fil4.vcf
step 9 remove sites quality score < 1/4 depth 297 9688 DP3g95maf05.fil5.vcf
step 10 depth x quality score cutoff 297 8325 11.00 DP3g95maf05.FIL.recode.vcf
step 11 HWE 0.000001 297 7978 7.00 DP3g95maf05.FIL.hwe.recode.vcf
step 12 He > 0.6 & Fis > 0.5 & Fis < -0.5 297 7973 8 min DP3g95maf05.FIL.hwe.HFis.recode.vcf
step 13 rename diplodus_all_filtered.vcf
\ No newline at end of file
step 11 He > 0.6 & Fis > 0.5 & Fis < -0.5 297 8206 27 min DP3g95maf05.FIL.HFis.recode.vcf
step 12 rename dip_all_filtered.vcf
\ No newline at end of file
......@@ -13,6 +13,5 @@ step 7 ration of mapping qualities reference vs alternate alleles 17546 DP3g95
step 8 paired status 17546 DP3g95maf05.fil4.vcf
step 9 remove sites quality score < 1/4 depth 17546 DP3g95maf05.fil5.vcf
step 10 depth x quality score cutoff 15466 DP3g95maf05.FIL.recode.vcf
step 11 HWE 0.000001 14532 DP3g95maf05.FIL.hwe.recode.vcf
step 12 He > 0.6 & Fis > 0.5 & Fis < -0.5 14530 25 min DP3g95maf05.FIL.hwe.Hfis.recode.vcf
step 13 rename mullus_all_filtered.vcf
\ No newline at end of file
step 11 He > 0.6 & Fis > 0.5 & Fis < -0.5 15232 DP3g95maf05.FIL.Hfis.recode.vcf
step 12 rename mul_all_filtered.vcf
\ No newline at end of file
setwd("~/Documents/project_SEACONNECT/seaConnect--dataPrep/02-Bayescan/")
#----- BayeScan -----#
## evaluate model convergence and extract results
## script adapted from http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2016/01/BayeScan_BayeScEnv_exercises.pdf
# load libraries
library(coda)
library(vcfR)
# set arguments
spCode <- "dip"
runNum <- 1
inputName <- paste0(spCode,"_bayesc_2pop_run",runNum)
## model convergence ####
chain1 <- read.table(paste0(inputName,"_ch1.sel"),header=TRUE)
#chain<-chain[-c(1)] # remove first column (step number)
# create a MCMC object with the correct thinning interval
chain1 <- mcmc(chain1,thin=10)
# Plot
plot(chain1)
# Summary statistics
summary(chain1)
# Test correlation between sampled parameters to verify that the sample is large enough
autocorr.diag(chain1)
# Effective size of the sample
effectiveSize(chain1)
# Statistic test for convergence
#### Geweke's convergence diagnistic
#The diagnostic reports the z- scores for each parameter.
#With ?? = 0.05, We reject H0 (equality of means => convergence) if z < -1.96 or z > +1.96.
geweke <- geweke.diag(chain1, frac1=0.1, frac2=0.5)
geweke
which(abs(geweke$z)>1.96) # 0 parameter does not converge
#### Heidelberg and Welch's test
heidel.diag(chain1, eps=0.1, pvalue=0.05) # all parameters passed
#### Test comparing two chains
chain2 <- read.table(paste0(inputName,"_ch2.sel"),header=TRUE)
#chain2<-chain2[-c(1)]
chain2 <- mcmc(chain2,thin=10)
plot(chain2)
combined = mcmc.list(chain1,chain2)
plot(combined)
gelman.diag(combined) # should be close to 1
gelman.plot(combined,ask)
## Bayescan results ####
## Functions defined in the script 'plot_R.r'in the BayeScan folder
results<-plot_bayescan(paste0(inputName,"_ch1_fst.txt"),FDR=0.05)
# detect outliers
outliers_bayescan <-results$outliers # position of outliers
results$nb_outliers #45 outliers
# extract corresponding SNP names and positions
vcf <- read.vcfR("../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf", verbose=F)
loci <- as.data.frame(vcf@fix[,1:2])
outlier_loci <- loci[results$outliers,]
outlier_positions <- outlier_loci$POS
# output positions table
write.table(outlier_positions, file = paste0("outl_pos_bayesc_",args[2],".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_bayesc_", args[2]), quote = 0)
#----- BayeScan -----#
## produce Bayescan input file
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
#args <- c("../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf","dip")
#args <- c("../01-SNPfilters/02-Mullus/mullus_all_filtered.vcf","mul")
start <- Sys.time()
# load libraries
library(vcfR)
library(adegenet)
library(radiator)
# define arguments
inputVcf <- args[1]
speciesCode <- args[2]
outputBayeScan <- paste0(speciesCode,"_bayesc_input_2pop")
# import vcf file to genind
vcfFile <- read.vcfR(inputVcf, verbose = F)
!is.null(vcfFile)
genind <- vcfR2genind(vcfFile)
!is.null(genind)
# use DAPC groups as pop
grp <- find.clusters(genind, max.n.clust = 40)
grp$size
pop(genind) <- grp$grp
# convert to bayescan format
genomic_converter(genind, output="bayescan", filename = outputBayeScan)
end <- Sys.time()
print(paste("run ended at", end))
end - start
# Outlier detection
## with BayeScan
### step 1: convert vcf files to Bayescan .txt files
This script will load your vcf file, determine the population identifier for each individual, and return a bayescan .txt inputfile
set arguments : argument 1 = input file (vcf), argument 2 = species code
script is currently set to detect and assign two populations. Run the script interactively if you want to determine and assign other values of K
#### for mullus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/02-Mullus/mullus_all_filtered.vcf mul
```
#### for diplodus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf dip
```
### step 2: determine outliers with Bayescan
download and compile Bayescan from [here](http://cmpg.unibe.ch/software/BayeScan/download.html) and copy the executable to your local bin folder:
```
cp ~/programmes/BayeScan2.1/binaries/BayeScan2.1_macos64bits /usr/local/bin/
```
#### run the BayeScan model for mullus and diplodus data
multiple runs with different parameters and different input data, with two chains per run to compare convergence later
```
bash run_bayescan.sh
```
### step 3: verify convergence
Run interactive R script called `Bayescan_evaluation.R`
run 1 seems to get "best" results for diplodus. Neither runs detects outliers for mullus.
extract outlier list and export loci positions for later subsetting
# This file is used to plot figures for the software Bayescan in R.
# This program, BayeScan, aims at detecting genetics markers under selection,
# based on allele frequency differences between population.
# Copyright (C) 2010 Matthieu Foll
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# Arguments:
# - file is the name of your file ex: "output_fst.txt"
# - the q-value threshold corresponding to the target False Discovery Rate (FDR)
# - size is the size of the points and text labels for outliers
# - pos is the distance between the points and the labels
# - highlight is a optional list of marker indices to display in red.
# - name_highlighted alows to write the indices of highlighted markers instead of using a point like the other markers
# - add_text adds the indices of the outlier markers
# Output:
# This function returns different paremeters in a list
# - outliers: the list of outliers
# - nb_outliers: the number of outliers
# Typical usage:
# - load this file into R (file/source R code)
# - in R, go to the directory where "output_fst.txt" is (file/change current dir)
# - at the R prompt, type
# > plot_bayescan("output_fst.txt",0,FDR=0.05)
# if you save the output in a variable, you can recall the different results:
# results<-plot_bayescan("output_fst.txt",0,FDR=0.05)
# results$outliers
# results$nb_outliers
#
# plotting posterior distribution is very easy in R with the output of BayeScan:
# first load the output file *.sel produced by BayeScan
# > mydata=read.table("bi.sel",colClasses="numeric")
# choose the parameter you want to plot by setting for example:
# parameter="Fst1"
# then this line will make the plot for:
# > plot(density(mydata[[parameter]]),xlab=parameter,main=paste(parameter,"posterior distribution"))
# you can plot population specific Fst coefficient by setting
# parameter="Fst1"
# if you have non-codominant data you can plot posterior for Fis coefficients in each population:
# parameter="Fis1"
# if you test for selection, you can plot the posterior for alpha coefficient for selection:
# parameter="alpha1"
# you also have access to the likelihood with:
# parameter="logL"
# if you have the package "boa" installed, you can very easily obtain Highest Probability
# Density Interval (HPDI) for your parameter of interest (example for the 95% interval):
# > boa.hpd(mydata[[parameter]],0.05)
plot_bayescan<-function(res,FDR=0.05,size=1,pos=0.35,highlight=NULL,name_highlighted=F,add_text=T)
{
if (is.character(res))
res=read.table(res)
colfstat=5
colq=colfstat-2
highlight_rows=which(is.element(as.numeric(row.names(res)),highlight))
non_highlight_rows=setdiff(1:nrow(res),highlight_rows)
outliers=as.integer(row.names(res[res[,colq]<=FDR,]))
ok_outliers=TRUE
if (sum(res[,colq]<=FDR)==0)
ok_outliers=FALSE;
res[res[,colq]<=0.0001,colq]=0.0001
# plot
plot(log10(res[,colq]),res[,colfstat],xlim=rev(range(log10(res[,colq]))),xlab="log10(q value)",ylab=names(res[colfstat]),type="n")
points(log10(res[non_highlight_rows,colq]),res[non_highlight_rows,colfstat],pch=19,cex=size)
if (name_highlighted) {
if (length(highlight_rows)>0) {
text(log10(res[highlight_rows,colq]),res[highlight_rows,colfstat],row.names(res[highlight_rows,]),col="red",cex=size*1.2,font=2)
}
}
else {
points(log10(res[highlight_rows,colq]),res[highlight_rows,colfstat],col="red",pch=19,cex=size)
# add names of loci over p and vertical line
if (ok_outliers & add_text) {
text(log10(res[res[,colq]<=FDR,][,colq])+pos*(round(runif(nrow(res[res[,colq]<=FDR,]),1,2))*2-3),res[res[,colq]<=FDR,][,colfstat],row.names(res[res[,colq]<=FDR,]),cex=size)
}
}
lines(c(log10(FDR),log10(FDR)),c(-1,1),lwd=2)
return(list("outliers"=outliers,"nb_outliers"=length(outliers)))
}
#!/bin/bash
# first run, two chains per run
# PCAdapt detected between two and three populations for mullus so we run the model for both options
#BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
#BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
#BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run1_ch1 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run1_ch2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
# second run, two chains per run
# here we try with a higher prior odds
#BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run2_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
#BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run2_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run2_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10000
# third run
# try again with prior odds = 100, and a greater number of iterations
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run3_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_2pop.txt -o mul_bayesc_2pop_run3_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run3_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits mul_bayesc_input_3pop.txt -o mul_bayesc_3pop_run3_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run3_ch1 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
BayeScan2.1_macos64bits dip_bayesc_input_2pop.txt -o dip_bayesc_2pop_run3_ch2 -n 10000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100
\ No newline at end of file
......@@ -3,13 +3,17 @@
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
#args <- c("../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf", "dip")
# load libraries
library(pcadapt)
library(vcfR)
# import vcf file to pcadapt format
# define arguments
path_to_file <- args[1]
speciesCode <- args[2]
# import vcf file to pcadapt format
dat <- read.pcadapt(path_to_file, type = "vcf")
# choose number K of principal components
......@@ -48,8 +52,8 @@ print(paste(length(outliers), "outliers detected"),quote=0)
vcf <- read.vcfR(path_to_file, verbose=F)
loci <- as.data.frame(vcf@fix[,1:2])
outlier_loci <- loci[outliers,]
outlier_positions <- outlier_loci$POS
outlier_positions <- droplevels(outlier_loci$POS)
# output positions table
write.table(outlier_positions, file = paste0("outl_pos_pcadpt_",args[2],".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_pcadpt_", args[2]), quote = 0)
write.table(outlier_positions, file = paste0("outl_pos_pcadpt_",speciesCode,".txt"), sep = "\t", quote = F, row.names = F, col.names = F)
print(paste0("outlier positions table exported to outl_pos_pcadpt_",speciesCode), quote = 0)
......@@ -9,7 +9,7 @@ set arguments : argument 1 = input file (vcf), argument 2 = species code
#### for mullus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/02-Mullus/mullus_all_filtered.vcf mul
Rscript --vanilla PCAdapt.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
see how many outliers were detected :
```
......@@ -19,8 +19,8 @@ wc -l outl_pos_pcadpt_mul.txt
#### for diplodus :
```
Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/diplodus_all_filtered.vcf dip
Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf dip
wc -l outl_pos_pcadpt_dip.txt
```
289 outliers detected for diplodus
312 outliers detected for diplodus
## final steps to get neutral and adaptive SNP set
# adaptive SNPs
### step 1: create a final list of all outlier loci positions
```
cat ../02-Bayescan/outl_pos_bayesc_dip.txt ../03-PCAdapt/outl_pos_pcadpt_dip.txt > outl_pos_combi_dip.txt
uniq -d outl_pos_combi_dip.txt | wc -l
```
no diplodus outliers were detected by both methods
keep only unique positions:
```
uniq -u outl_pos_combi_dip.txt > outl_pos_dip_433.txt
```
### step 2: subset vcf file by outlier loci positions
```
vcftools --vcf ../01-SNPfilters/dip_all_filtered.vcf --positions outl_pos_dip_433.txt --recode-INFO-all --out diplodus_adaptive
mv diplodus_adaptive.recode.vcf diplodus_adaptive.vcf
```
# neutral SNPs
### step 1 : remove outlier loci
### step 2 : filter for HWE
step 11 : HWE
recommends by pop but since I have mostly one large pop: apply to all
```
vcftools --vcf dip_all_filtered.vcf --hwe 0.000001 --recode --recode-INFO-all --out mul.FIL.HFis.hwe
```
......@@ -2,4 +2,146 @@
Scripts to prepare RADSeq data for analysis.
- filtering steps for dDocent SNP output
- outlier detection
- file conversions, subsetting and renaming
## 01-SNPfiltering
script adapted from [ddocent tutorial](https://www.ddocent.com/filtering/) and additions
### run the script
move to the correct directory and make the output directories
```
cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
$3 = species code