Commit 1e09e4ad authored by eboulanger's avatar eboulanger
Browse files

update readme

parent defd337c
......@@ -27,11 +27,7 @@ cd ~/Documents/project_SEACONNECT/seaConnect--dataPrep/01-SNPfilters/
mkdir 01-Diplodus
mkdir 02-Mullus
```
start a log of the terminal output
```
script 01-Diplodus/filtering.log
script 02-Mullus/filtering.log
```
set the species-specific arguments for the script to run on
$1 = input file (specify path if not in current directory)
$2 = output folder
......@@ -40,28 +36,26 @@ set the species-specific arguments for the script to run on
bash filtering.sh ../00-rawData/01-Diplodus/sar_ddocent.vcf 01-Diplodus dip
bash filtering.sh ../00-rawData/02-Mullus/mullus.vcf 02-Mullus mul
```
when all steps are finished: finish the terminal output log
```
exit
```
The final step consists of removing individuals with (extreme) outlier heterozygosity values. Here we define outliers as falling outside 6 * interquartile range.
This theshold value is set in accordance to our data and it is advised to look at the outputted figures to validate this choice for your data.
### SNP filtering results for Mullus surmuletus
| filtering step | filter for | individuals retained | SNPs retained | run time (sec) | output |
| -------------- | ------------------------------------------------- | :------------------: | :-----------: | :------------: | ------ |
| step 0 | ddocent output data | 431 | 49027 | | mullus.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 431 | 49027 | 71.00 | mullus.g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 431 | 49027 | 74.00 | mullus.g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 424 | 49027 | 70.00 | mullus.g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 424 | 18727 | 14.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | | 17965 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | | **SKIP** | |
| step 7 | ration mapping qualities ref vs alternate alleles | | 17546 | | DP3g95maf05.fil3.vcf
| step 8 | paired status | | 17546 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | | 17546 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 424 | 15466 | |
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 424 | 15232 | 25 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | remove extreme outliers individual O HET | 413 | 15232 | 23.00 | DP3g95maf05.FIL.HFis.indHet.recode.vcf
| step 0 | ddocent output data | 526 | | | mullus.vcf
| step 1 | call below 50%, mac < 3, min quality score = 30 | 526 | 549881 | 71.00 | mullus.g5mac3.recode.vcf
| step 2 | min mean depth genotype call = 3 | 482 | 549881 | 74.00 | mullus.g5mac3dp3.recode.vcf
| step 3 | individuals > 50% missing data | 482 | 549881 | 70.00 | mullus.g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 482 | 3657 | 14.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | 482 | 3421 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | 482 | **SKIP** | |
| step 7 | ration mapping qualities ref vs alternate alleles | 482 | 3284 | | DP3g95maf05.fil3.vcf
| step 8 | paired status | 482 | 3284 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | 482 | 3263 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 482 | 2857 | |
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 482 | 2794 | 25 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | remove extreme outliers individual O HET | 482 | 2794 | 23.00 | DP3g95maf05.FIL.HFis.indHet.recode.vcf
| step 13 | rename | | | | mul_all_filtered_origid.vcf
### SNP filtering results for Diplodus sargus
......@@ -74,13 +68,13 @@ exit
| step 3 | individuals > 50% missing data | 297 | 13362 | 16.00 | g5mac3dplm.recode.vcf
| step 4 | sites > 5% missing data, maf 0.05, min meanDP = 5 | 297 | 10389 | 11.00 | DP3g95maf05.recode.vcf
| step 5 | filter for allele balance | 297 | 10202 | | DP3g95maf05.fil1.vcf
| step 6 | filter out sites with reads from both strands | SKIP | **SKIP** | | SKIP
| step 6 | filter out sites with reads from both strands | | **SKIP** | | SKIP
| step 7 | ration mapping qualities ref vs alternate alleles | 297 | 9689 | | DP3g95maf05.fil3.vcf
| step 8 | paired status | 297 | 9689 | | DP3g95maf05.fil4.vcf
| step 9 | remove sites quality score < 1/4 depth | 297 | 9688 | | DP3g95maf05.fil5.vcf
| step 10 | depth x quality score cutoff | 297 | 8325 | 11.00 |
| step 11 | He > 0.6 & Fis > 0.5 & Fix < -0.5 | 297 | 8206 | 27 min | DP3g95maf05.FIL.HFis.recode.vcf
| step 12 | remove extreme outliers individual O HET | 297 | | | DP3g95maf05.FIL.HFis.indHet.recode.vcf
| step 12 | remove extreme outliers individual O HET | 297 | 8206 | | DP3g95maf05.FIL.HFis.indHet.recode.vcf
| step 13 | rename | | | | dip_all_filtered_origid.vcf
### rename individuals for conventional naming system and create population map
......@@ -93,49 +87,7 @@ output:
- dip_all_filtered.vcf
- dip_popualtion_map_297ind.txt
- mul_all_filtered.vcf
- mul_population_map_413ind.txt
## 02-Bayescan
Detect Fst outliers by bayesian inference with the [BayeScan software](http://cmpg.unibe.ch/software/BayeScan/)
### step 1: convert vcf files to Bayescan .txt files
This script will load your vcf file, determine the population identifier for each
individual, and return a bayescan .txt inputfile
set arguments :
$1 = input file (vcf)
$2 = species code
The script is currently set to detect and assign two populations (K). Run the script
interactively if you want to determine and assign other values of K.
#### for mullus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf mul
```
#### for diplodus :
```
Rscript --vanilla Bayescan_input.R ../01-SNPfilters/01-Diplodus/dips_all_filtered.vcf dip
```
### step 2: determine outliers with Bayescan
download and compile Bayescan from [here](http://cmpg.unibe.ch/software/BayeScan/download.html)
and copy the executable to your local bin folder:
```
cp ~/programmes/BayeScan2.1/binaries/BayeScan2.1_macos64bits /usr/local/bin/
```
#### run the BayeScan model for mullus and diplodus data
multiple runs with different parameters and different input data, with two chains per run
to compare convergence later
```
bash run_bayescan.sh
```
### step 3: verify convergence and extract outliers
Run interactive R script called `Bayescan_evaluation.R`
The script also extracts outlier lists for the different runs and export loci positions for later subsetting (with run index)
- mul_population_map_467ind.txt
## 03-PCAdapt
......@@ -147,6 +99,8 @@ R and retrieve loci positions from vcf file
This script will load your vcf file, detect outlier loci using the package pcadapt and
output a list of outlier loci positions that can be used to subset your vcf file
EDIT: must run interactively to choose K for each species
set arguments :
$1 = input file (vcf)
$2 = species code
......@@ -159,7 +113,7 @@ see how many outliers were detected :
```
wc -l outl_pos_pcadpt_mul.txt
```
1327 outliers detected for mullus
291 outliers detected for mullus
#### for diplodus :
```
......@@ -167,14 +121,14 @@ Rscript --vanilla PCAdapt.R ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf di
wc -l outl_pos_pcadpt_dip.txt
```
388 outliers detected for diplodus
413 outliers detected for diplodus
## 04-finalPrep
Final steps to get neutral and adaptive SNP sets and correct file formats for both species.
This script subsets the filtered vcf file from 01-SNPfilters by outlier positions detected
in 02-BayeScan and 03-PCAdapt.
in 03-PCAdapt.
It also subsets the same vcf file for the remaining neutral positions and applies a final
filter for HWE.
......@@ -189,7 +143,6 @@ outputted .gen.txt file: add information on first line (otherwise genodive won't
To run the script, set arguments:
$1 = input file (vcf)
$2 = species code
$3 = bayescan run index
#### for diplodus
```
......@@ -197,11 +150,10 @@ mkdir 01-Diplodus
cp ../01-SNPfilters/01-Diplodus/dip_population_map_*.txt 01-Diplodus/
cp ../01-SNPfilters/01-Diplodus/dip_all_filtered.vcf 01-Diplodus/
bash outlier_positions.sh 01-Diplodus/dip_all_filtered.vcf dip run1
bash outlier_positions.sh 01-Diplodus/dip_all_filtered.vcf dip
```
In total, 494 outlier loci were detected, with 10 loci detected by both the BayeScan and PCAdapt method.
After HWE filter, 7570 neutral loci were retained.
After HWE filter, 7655 neutral loci were retained.
#### for Mullus
```
......@@ -209,10 +161,9 @@ mkdir 02-Mullus
cp ../01-SNPfilters/02-Mullus/mul_population_map_*.txt 02-Mullus/
cp ../01-SNPfilters/02-Mullus/mul_all_filtered.vcf 02-Mullus/
bash outlier_positions.sh 02-Mullus/mul_all_filtered.vcf mul run1
bash outlier_positions.sh 02-Mullus/mul_all_filtered.vcf mul
```
In total, 2680 adaptive loci were detected, with 10 loci detected by both the BayeScan and PCAdapt method.
After HWE filter, 12432 neutral loci were retained.
After HWE filter, 2462 neutral loci were retained.
#### clean up files to separate directories
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment