Commit 3a267df7 authored by peguerin's avatar peguerin
Browse files

Initial commit

parents
Codes for the paper : Predicting genotype environmental range from genome–environment associations
===============================================================================
**Stéphanie Manel, Marco Andrello, Karine Henry, Daphné Verdelet, Aude Darracq, Pierre‐Edouard Guerin, Bruno Desprez, Pierre Devaux**
Molecular Ecology, may 2017
_______________________________________________________________________________
This repository contains all the scripts to calculate metrics (nucleotide diversity, Tajima's D...) on the beets genome from SNP data.
These metrics were necessary to validate our approach to predict the environmental range of species genotypes from the genetic markers significantly associated with those environmental variables in an independent set of individuals.
We applied this approach to predict aridity in a database constituted of 950 individuals of wild beets and 299 individuals of cultivated beets genotyped at 14,409 random single nucleotide polymorphisms (SNPs).
This study was funded by the French Government, under the management of the Research National Agency (ANR‐11‐BTBR‐0007) through the [AKER programme](http://www.aker-betterave.fr) in collaboration with [Florimond Desprez company](http://www.florimond-desprez.com).
_______________________________________________________________________________
# Prerequisites
## Softwares
You must install the following softwares :
* [VCFTOOLS](http://vcftools.sourceforge.net/)
* [TABIX & BGZIP](https://github.com/samtools/htslib/releases/tag/1.4.1)
* [R Version 3.2.3](https://cran.r-project.org/)
* `R-package` ggplot2
* `R-package` PopGenome
* [Python 2.7.12](https://www.python.org/)
## Data Files
The included data files are :
* [Positions.14409.txt](data/Positions.14409.txt): List of ID|position|scaffold|chromosome of 14409 SNPs.
* [Data.950.sauvages.txt](data/Data.950.sauvages.txt): List of names of the 950 indivuals of interest.
* [NoPool.14409.csv](data/NoPool.14409.csv): Table of genotypes of all the indivuals for the 14409 SNPs.
* [Noms.marqueurs.LFMM.gINLAnd.csv](data/Noms.marqueurs.LFMM.gINLAnd.csv): list of SNP IDs and results from Ginland & LFMM methods
* [bon_exemple.vcf](bon_exemple.vcf): Template of VCF format file used to create VCF files.
# Code sources
scripts used to calculate statistics on the genome from SNP data
## BASH scripts
* [vcf4PopGenome_protocole.sh](vcf4PopGenome_protocole.sh) : Creates VCF.GZ files using TABIX, BGZIP and VCFTOOLS for each chromosome into [mes_vcf/](mes_vcf). Uncompressed VCF files will be saved into a new [mes_vcf_save/](mes_vcf_save).
* [fabrique_outlier.sh](fabrique_outlier.sh): Generates a list of outliers SNPs positions for each chromosome into [mes_outliers/](mes_outliers).
* [add_ID_to_tables.sh](add_ID_to_tables.sh) : Creates tables with ID of SNPs into [tables/avec_id](tables/avec_id) folder from [tables/](tables) results.
## Python scripts
* [get_col.py](get_col.py): Select columns of a CSV file according to a list of colunm's names.
* [convert_data2vcf.py](convert_data2vcf.py) : Creates VCF files for each chromosome of each SNP with genotype of each indivuals.
* [get_id_snp.py](get_id_snp.py): get position|chromosome information in a table of SNP and find his ID in a VCF file.
## R scripts
* [fabrique_outlier.R](fabrique_outlier.R): From [Noms.marqueurs.LFMM.gINLAnd.csv](data/Noms.marqueurs.LFMM.gINLAnd.csv), it provides a list of outliers SNPs IDs [nom_84_outliers.txt](nom_84_outliers.txt)
* [generate_fig_tab.R](generate_fig_tab.R) : Generates sliding windows genome statistics into [figures/](figures) and [tables/](tables) folders using [mes_vcfs/](mes_vcfs) and [mes_outliers/](mes_outliers) data.
* [analysis_tables.R](analysis_tables.R) : Basic statistical analysis on `tables/avec_id/all_stats_propre.csv`
# Workflow
## Calculate metrics (nucleotide diversity, Tajima's D...) on outliers and non-outliers SNPs and analysis
### 1. Creates VCF files of all the SNP and selected individuals
* Run [vcf4PopGenome_protocole.sh](vcf4PopGenome_protocole.sh) to create VCF.GZ files into [mes_vcf/](mes_vcf) and VCF files into [mes_vcf_save/](mes_vcf_save), using [data](data) files and [bon_exemple.vcf](bon_exemple.vcf)
```
bash vcf4PopGenome_protocole.sh
```
### 2. Get outliers SNPs and their positions
* Run [fabrique_outlier.R](fabrique_outlier.R) to create [nom_84_outliers.txt](nom_84_outliers.txt), the list of outliers SNPs IDs
* Run [fabrique_outlier.sh](fabrique_outlier.sh) to create a list of outliers SNPs positions for each chromosome into [mes_outliers/](mes_outliers)
```
bash fabrique_outlier.sh
```
### 3. Generates figures and table of genome sliding 20Kbp-windows metrics
* Run [generate_fig_tab.R](generate_fig_tab.R) to generate plot .PDF figures into [figures/](figures) folder and .CSV tables into [tables/](tables) for each chromosome
```
Rscript generate_fig_tab.R
```
### 4. Analyse statistics on tables
* Run [add_ID_to_tables.sh](add_ID_to_tables.sh) to generate a merged table [tables/avec_id/all_stats_propre.csv](tables/avec_id/all_stats_propre.csv) with SNPs IDs and statistics of all the chromosomes from [tables/](tables) to [tables/avec_id/](tables/avec_id)
* Run [analysis_tables.R](analysis_tables.R) to have outliers SNP and non-outliers SNP statistics, then do some basic statistical tests.
```
bash add_ID_to_tables.sh
Rscript analysis_tables.R
```
# Results
## Beets genome chromosome 1 sequence: SNPs-metrics and significantly associated with aridity SNPs positions
![chr1 metrics](figures/chr1_stats.png)
Every 5 Kbp nucleotide steps, we calculated following metrics on a 20Kbp windows onto the genome :
* **SNP density**: number of SNPs
* **pi** : nucleotide diversity
* **D**: Tajima D
* **D\***: Fu and Li D
Positions of genetic markers SNP significantly associated with aridity environmental variable are indicated by a black dot
\ No newline at end of file
#mkdir tables/avec_id
#Get IA des SNP
COUNTER=1
for vcfFichier in `ls mes_vcf/chr*.vcf.gz`
do
python get_id_snp.py -f <(zcat $vcfFichier) -f2 "tables/chr"$COUNTER"_stats.csv" -o "tables/avec_id/chr"$COUNTER"_stats_id.csv"
let COUNTER=COUNTER+1
done
#merge all tables
cat tables/avec_id/chr*_stats_id.csv >> tables/avec_id/all_stats.csv
#enlever les NA
grep -v 0,0,0 tables/avec_id/all_stats.csv > tables/avec_id/all_stats_propre.csv
############################
#fenetre 1kB
test = read.csv("tables_w1Kb/train/all_stats_propre.csv",header=F)
#fenetre 100kB avec id
test = read.csv("tables_w100Kb/avec_id/all_stats_propre.csv")
##fenetre 20kB avec id
test = read.csv("tables_w20Kb/avec_id/all_stats_propre.csv")
test = read.csv("tables_w20Kb/avec_id/all_stats.csv")
colnames(test)= c("id","chrom","position", "is_outlier", "nt_diversity","nt_diversity_per_site","D_tajima","Fu_li_D")
#snp density
snpdensity = test$nt_diversity/test$nt_diversity_per_site
snpdensity=replace(snpdensity,is.na(snpdensity),0)
test = cbind(test,snpdensity)
outlier = test[which(test$is_outlier==1),]
inlier = test[which(test$is_outlier==0),]
zin = inlier[which(inlier$snpdensity > 2),]
#wilcoxon tester liaison outlier/inlier (quali) versus nt_divers (quanti)
wilcox.test(outlier$nt_diversity,inlier$nt_diversity)
wilcox.test(outlier$nt_diversity_per_site,inlier$nt_diversity_per_site)
wilcox.test(outlier$D_tajima,inlier$D_tajima)
res = t.test(outlier$nt_diversity,inlier$nt_diversity)
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
#================================================================================
#AUTHORS
#================================================================================
'''
GUERIN Pierre-Edouard
France, Montpellier
CNRS, CEFE
05/2017
'''
#================================================================================
#NOTICE
#================================================================================
'''
convert CSV file into list of individual allele A (SNPs) sequence
for homozygote get A or B allele
for heterozygote get [randomly] A or B allele
and put all the data into VCF FORMAT able to be read by Rpackage PopGenome
'''
#================================================================================
#MODULES
#================================================================================
import re
import sys
import subprocess
import argparse
import os.path
import random
#================================================================================
#CLASS
#================================================================================
class SNP:
def __init__(self,nom,chrom,pos,alleleRef,alleleAlt):
self.nom = nom
self.dicOfIndv={}
self.chrom = chrom
self.pos = pos
self.alleleRef=alleleRef
self.alleleAlt=alleleAlt
def rename_snp(self,nom):
self.nom = nom
def add_indv_GT(self,indv_name,genoGT):
self.dicOfIndv[indv_name] = genoGT
class Chromosome:
def __init__(self,numer):
self.numero = numer
self.dicOfSnp = {}
def append_snp(self,name_snp, snp):
self.dicOfSnp[name_snp] = snp
def set_stats(self,minimum,maximum,length):
self.minimum = minimum
self.maximum = maximum
self.length = length
def remove_snp(self,snp_name):
del self.dicOfSnp[snp_name]
def update_snp(self,snp_name,new_snp):
self.dicOfSnp[snp_name]=new_snp
#================================================================================
#FUNCTIONS
#================================================================================
#================================================================================
#ARGUMENTS
#================================================================================
parser = argparse.ArgumentParser(description='specie seq geo')
parser.add_argument("-o","--output", type=str)
parser.add_argument("-f","--inputFile",type=str)
parser.add_argument("-f2","--inputFile2",type=str)
parser.add_argument("-vcf","--vcfbase",type=str)
parser.add_argument("-f3","--inputFile3",type=str)
#================================================================================
#MAIN
#================================================================================
args = parser.parse_args()
resultsPath = args.output
inputFile = args.inputFile
inputFile2=args.inputFile2
vcfbase=args.vcfbase
previous_scaff=0
shift_pos = 0
previous_chrom=0
previous_scaff_pos=0
#recuperer la liste des coords CHR:POS des SNPs
with open(inputFile2,'r') as readFile2:
dicOfSNP = {}
listOfChrom = []
#initialiser la liste de chromosome
for ch_numero in xrange(1,10):
listOfChrom.append(Chromosome(ch_numero))
for ligne in readFile2.readlines()[1:]:
ligneSplit=ligne.split()
nom=ligneSplit[0]
chrom=ligneSplit[1]
scaff=ligneSplit[2]
scaff_pos=ligneSplit[3]
if chrom != "NA" and chrom != "Chromosome" and scaff_pos !="NA" and scaff != "NA" and scaff != "Scaffold":
if previous_chrom != chrom:
shift_pos=0
previous_scaff=0
previous_scaff_pos=0
if previous_scaff != scaff:
shift_pos+=int(previous_scaff_pos)+150000
previous_scaff=scaff
pos=int(scaff_pos)+shift_pos
previous_scaff_pos=int(scaff_pos)
previous_chrom=chrom
dicOfSNP[nom] = SNP(nom,chrom,pos,"-","-")
listOfChrom[int(chrom)-1].append_snp(nom,dicOfSNP[nom])
readFile2.close()
#A ce stade on a la liste des coords CHR:POS par nom pour chaque SNP
header=1
nb_chr=len(listOfChrom)
nb_chr=9
for ch_id in xrange(0,nb_chr):
#on ouvre le fichier des genotypes par nom pour chaque indv pour chaque SNP
with open(inputFile,'r') as readFile:
for ligne in readFile.readlines():
ligneSplit=ligne.split(";")
if header==1:
official_name_indv=[]
for chaque_indv in ligneSplit[10:-1]:
official_name_indv.append(chaque_indv)
header=0
nom = ligneSplit[0]
#si le SNP est present sur le CHR
if nom in listOfChrom[ch_id].dicOfSnp:
alleleRef = str(ligneSplit[1])
alleleAlt= str(ligneSplit[2])
if alleleRef!="-" and alleleAlt !="-":
#alleles = [alleleRef,alleleAlt]
#genot de chaque indv
chrom=listOfChrom[ch_id].dicOfSnp[nom].chrom
pos=listOfChrom[ch_id].dicOfSnp[nom].pos
new_snp=SNP(nom,chrom,int(pos),alleleRef,alleleAlt)
listOfChrom[ch_id].dicOfSnp[nom]=new_snp
id_indv=0
for indvGeno in ligneSplit[10:-1]:
if indvGeno =="AA":
genoGT="0|0"
elif indvGeno == "BB":
genoGT="1|1"
else:
genoGT = random.choice(["0|1","1|0"])
indv_nom=str(official_name_indv[id_indv])
listOfChrom[ch_id].dicOfSnp[nom].add_indv_GT(indv_nom, genoGT)
id_indv+=1
else:
listOfChrom[ch_id].remove_snp(nom)
#suprimer le SNP dont le genotype estmal defini
readFile.close()
#informations max/min/numberofSNP
numero_chr=1
for ch in listOfChrom:
listOfSnp=[]
for key,value in ch.dicOfSnp.items():
listOfSnp.append(value.pos)
sortedSnp = sorted(listOfSnp)
chminimum =str(sortedSnp[0])
chmaximum =str(sortedSnp[-1])
chlength=str(len(sortedSnp))
print "chr"+str(numero_chr), "minpos:"+chminimum+" |", "maxpos:"+chmaximum+" |","numberofSNPs:"+chlength
numero_chr+=1
#ecrire les fichiers VCF
vcfheader=open(vcfbase, 'r').read()
numero_chr=1
indvheader =""
for indv in official_name_indv:
indvheader+="{0}\t".format(indv)
indvheader=indvheader[0:-1]
indvheader+="\n"
vcftextbase=vcfheader[0:-1]+indvheader
for ch in listOfChrom:
#print key, value.chrom, value.pos, value.dicOfIndv["M03212"]
outputFile=resultsPath+"/chr"+str(numero_chr)+".vcf"
with open(outputFile,'w') as writeFile:
vcftext=vcftextbase
for key, value in ch.dicOfSnp.items():
vcftext+="{0}\t{1}\t{2}\t".format(value.chrom,value.pos,value.nom)
vcftext+="{0}\t{1}\t.\t.\t.\tGT".format(value.alleleRef,value.alleleAlt)
for indv_nom in official_name_indv:
vcftext+="\t{0}".format(value.dicOfIndv[indv_nom])
vcftext+="\n"
writeFile.write(vcftext)
writeFile.close()
numero_chr+=1
print "Done"
#for i in xrange(0,len(snps[0])):
# print snps[0][0]
probeset_id
Allele_A
Allele_B
ConversionType.y
CR.y
MinorAlleleFrequency.y
n_AA.y
n_AB.y
n_BB.y
n_NC.y
Z00043
Z02277
Z00737
Z00081
Z00084
Z00086
Z00087
Z00090
Z00103
Z00104
Z00105
Z00107
Z00112
Z04253.15
Z03471
Z00150
Z00156
Z00160
Z00162
Z00167
Z00168
Z00170.1
Z00170.10
Z00170.11
Z00170.12
Z00170.13
Z00170.14
Z00170.15
Z00170.2
Z00170.3
Z00170.4
Z00170.5
Z00170.6
Z00170.7
Z00170.8
Z00170.9
Z00175
Z00178
Z00180
Z00182
Z00186
Z00187
Z00189
Z00190
Z00195
Z00198
Z00200
Z00201
Z00208
Z00209
Z00210
Z00216
Z00217
Z00219
Z00220
Z00221
Z00223
Z00224
Z00225
Z00228
Z00230
Z00231
Z00234
Z00235
Z00236
Z00237
Z00240
Z00241
Z00242
Z00243
Z00244
Z00246
Z00247
Z00248
Z00249
Z00250
Z00251
Z00252
Z00253
Z00255
Z00256
Z00257
Z00259
Z00260
Z00261
Z00264
Z00265
Z00270
Z00285
Z00293
Z00294
Z00295
Z00296
Z00298
Z00300
Z00306
Z00307
Z00308
Z00309
Z00365.1
Z00365.10
Z00365.12
Z00365.13
Z00365.14
Z00365.15
Z00365.2
Z00365.3
Z00365.4
Z00365.5
Z00365.6
Z00365.7
Z00365.9
Z00399
Z00406
Z00409
Z00411
Z00412.CORE
Z00412.1
Z00412.10
Z00412.11
Z00412.12
Z00412.13
Z00412.14
Z00412.2
Z00412.3
Z00412.4
Z00412.5
Z00412.6
Z00412.7
Z00412.8
Z00412.9
Z00420
Z00449
Z00450
Z00456
Z00466
Z00468
Z00470
Z00474
Z00485
Z00492
Z00493
Z00494
Z00499
Z00500
Z00501
Z00502
Z00505
Z00507
Z00508
Z00511
Z00528
Z00539
Z00540
Z00541
Z00542
Z00543
Z00546
Z00547
Z00548
Z00549
Z00550
Z00552
Z00553
Z00554
Z00558
Z00559
Z04215
Z00563
Z00564
Z00565
Z00566
Z00567
Z00568
Z00570
Z00571
Z00561
Z00575
Z00581
Z00595
Z00596
Z00610
Z00631
Z00632
Z00643
Z00644
Z00645
Z00646
Z00647
Z00648
Z00649