# Andreia Miraldo, Sen Li, Michael K. Borregaard, Alexander Floréz-Rodriguéz, Shyam Gopalakrishnan, Mirneza Risvanovic, Zhiheng Wang, Carsten Rahbek, Katharine A. Marske & David Nogués-Bravo
#
# Submitted to Science, 2016
# Code in this file by Sen Li and Michael K. Borregaard
A function to compute the GD (Genetic Diversity) value from a set of sequences. The algorithm compares all pairwise combinations of sequences and calculates the proportion of loci that differ between the pair. Returns a DataFrame with the identity of sequences, the lengths of each sequence, the overlap, and the computed pairwise divergence values.
**Parameters**
* 'species_seqs': A matrix where the rows are aligned genetic sequences, and columns are loci. Basepairs must be coded as 1, 2, 3 or 4, or with a 0 signifying that the locus is absent from the alignment.
"""
function compare_pairwise(species_seqs::Matrix{Int})
num_seqs=size(species_seqs,1)# count the number of sequences
if(num_seqs<2)# return an empty state if there is only one sequence
A function to create master data matrices that are used to compute genetic diversity, assess data quality and do sensitivity analyses.
**Parameters**
* 'foldername' : The name of of a folder containing the data files. There must be files of two types (file extension 'fasta' and file extension 'coords') with the same filename, e.g. the species names (e.g. folder contents could be 'Bufo_bufo.fasta, Bufo_bufo.coords, Rana_arvalis.fasta, Rana_arvalis.coords', etc.). The .fasta files contain the sequences as an m x n integer matrix, where m is the number of sequences and n is the length of the alignments. The .coords files contain the geographic coordinates of the sequences, as an m x 2 floating point matrix with latitude in the first column and longitude in the second.
"""
function create_master_matrices(foldername::ASCIIString)
cluster_name=basename(foldername)
species_list=[x[1:(end-6)]forxinfilter(st->contains(st,".fasta"),readdir(foldername))]#identify unique file names ignoring the extension
num_files=size(species_list,1)
equalarea=latbands=gridcells=DataFrame(species=String[],cell=String[],seq1=Int[],seq2=Int[],length_seq1=Int[],length_seq2=Int[],overlap=Float64[],commons=Int[],num_per_bp=Float64[])# Pre-initialize the DataFrame to ensure correct element types
A function to calculate the summary statistic for all sites (e.g. grid cell or biome) for one species.
**Parameters**
* 'species': A string with the name of the species
* 'species_seqs': A matrix where the rows are aligned genetic sequences, and columns are loci. Basepairs must be coded as 1, 2, 3 or 4, or with a 0 signifying that the locus is absent from the alignment.
* 'sitenames': A vector of strings with the names of the sites
"""
function calcspecies(species::String,species_seqs::Matrix{Int},sitenames::Vector{String})
tmp=DataFrame(species=fill(species,lns),cell=fill(uniq_grid[iter_grid,:][1],lns))#Expand species and cell names to the length of the resulting DataFrame
# Andreia Miraldo, Sen Li, Michael K. Borregaard, Alexander Floréz-Rodriguéz, Shyam Gopalakrishnan, Mirneza Risvanovic, Zhiheng Wang, Carsten Rahbek, Katharine A. Marske & David Nogués-Bravo
chaque ligne représente une cellule (chaque cellule est représentée par une seule ligne) et contient les coordonnées de le cellule, la moyenne (peut être aussi calculer la médiane) de la diversité génétique, et aussi à ajouter pour chaque ligne 1) le nombre d'espèces dans la cellule 2) le nombre moyen (et median) d'individu par espèces et (3) avec la standard deviation.