Analyse scripts for "Global patterns of fish genetic diversity increase with “current” temperature" ================================================ 2017 This folder contains all the scripts to reproduce all the analysis # Table of contents 1. [Introduction](#1-introduction) 2. [Installation](#2-installation) 1. [Prerequisite](#21-prerequisite) 2. [Data Files](#22-data-files) 3. [Set up](#23-set-up) 3. [Scripts Code Source](#3-scripts-code-source) 4. [Reporting bugs](#4-reporting-bugs) 5. [Running the pipeline](#5-running-the-pipeline) 1. [Filter raw data](#51-filter-raw-data) 2. [Georeferenced sequences alignments by species](#52-data-files) 3. [Species sequence pairwise comparison](#53-species-sequence-pairwise-comparison) 4. [Genetic Diversity calculation](#54-genetic-diversity-calculation) 5. [Statistical analysis](#55-statistical-analysis) # 1. Introduction blablabla # 2. Installation ## 2.1 Prerequisite You must install the following softwares and packages : - [JULIA Version 0.5.2](https://julialang.org/) - [R Version 3.2.3](https://cran.r-project.org/) - [R-package]ggplot2 `install.packages("ggplot2")` - [R-package]rgeos `install.packages("rgeos")` - [R-package]rgdal `install.packages("rgdal")` it may require to install `libgdal-dev` - [R-package]nodiv `install.packages("nodiv")` - [R-package]raster `install.packages("raster")` - [R-package]lme4 `install.packages("lme4")` - [R-package]sp `install.packages("sp")` - [R-package]sjPlot `install.packages("sjPlot")` - [R-package]FactoMineR `install.packages("FactoMineR")` - [R-package]factoextra `install.packages("factoextra")` - [R-package]spdep `install.packages("spdep")` - [R-package]countrycode `install.packages("countrycode")` - [Python Version 2.7.12](https://www.python.org/) - [MUSCLE Version 3.8.31](https://www.drive5.com/muscle/) ## 2.2 Data Files The included data files are : * `02-raw_data/seqbold_data.txt` : Georeferenced sequences of individuals from the supergroup "actinopterygii" have been downloaded from [http://www.boldsystems.org/index.php/Public_SearchTerms?taxon=&searchMenu=records&query=actinopterygii] * `01-infos/grid_equalarea200km` : Shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km) * `01-infos/ne_110m_land` : Shapefile of worldcoast from (http://www.naturalearthdata.com) * `01-infos/equalarea_id_coords.tsv` : ID and left/right/top/bottom coordinates of each equal area into the shapefile grid_equalarea200km. * `01-infos/marine_actinopterygii_species.txt` : List of "actinopterygii" saltwater species according to (http://www.fishbase.org/) * `01-infos/freshwater_behrman_worldcoast_data_object.RData`: R spatial object from "sp package" which is an equal area grid in Berhmann projection with worldcoast shape and presence/absence of freshwater fish species from (https://www.iucn.org/) * `01-infos/marine_behrman_worldcoast_data_object.RData` : ... of marine fish species. * IL EN MANQUE... ## 2.3 Set Up clone the project and switch to the main folder, it's your working directory ``` git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/worldmap_fish_genetic_diversity.git cd worldmap_fish_genetic_diversity ``` Then you will need to download georeferenced sequences of actinopterygii individuals "combined TSV file" from (http://www.boldsystems.org/index.php/Public_SearchTerms?taxon=&searchMenu=records&query=actinopterygii) Store it into the folder `02-raw_data` and rename it `seqbold_data.txt` You're ready to run the analysis. Now follow the instructions at [Running the pipeline](#5-running-the-pipeline) # 3. Scripts Code Source ## 3.1 [00-scripts/step1](00-scripts/step1) : filter raw data - BASH scripts * [filter_raw_data.sh](00-scripts/step1/filter_raw_data.sh) : Keep only the CO1 sequences with lat/lon information * [get_geonames_coordinates.sh](00-scripts/step1/get_geonames_coordinates.sh) : Uses (http://www.geonames.org/) to find missing coordinates of individual sequences from their textual information of location. - PYTHON scripts * [lat_long_DMS_DD_converter.py](00-scripts/step1/lat_long_DMS_DD_converter.py) : Converts from DMS format to DD format the given coordinates. ## 3.2 [00-scripts/step2](00-scripts/step2) : georeferenced sequences alignments by species - BASH scripts * [seq_alnt_filtered_data.sh](00-scripts/step2/seq_alnt_filtered_data.sh) : aligns sequences from the same species with MUSCLE and creates coordinates file for each sequence. * [cluster_freshwater_vs_marine.sh](00-scripts/step2/cluster_freshwater_vs_marine.sh) : according to a list of marine species, moves the fasta and coords files into marine, freshwater repertories. - PYTHON scripts * [fasta_coords_files_species_generator.py](00-scripts/step2/fasta_coords_files_species_generator.py) : extracts sequences and associated coordinates from the filtered data. - R scripts * [equalareacoords.R](00-scripts/step2/equalareacoords.R) : attributes at each sequence an ID of cell of the shapefile of worldmap equal area projection from its coordinates. ## 3.3 [00-scripts/step3](00-scripts/step3) : species sequence pairwise comparison - JULIA scripts * [Lib_Compare_Pairwise.jl](00-scripts/step3/Lib_Compare_Pairwise.jl) : functions to compute the Genetic Diversity value from a set of sequences. * [Lib_Create_Master_Matrices.jl](00-scripts/step3/Lib_Create_Master_Matrices.jl) : functions to create master data matrices that are used to compute genetic diversity. * [master_matrices.jl](00-scripts/step3/master_matrices.jl) : generates master data matrices from species sequences alignments. ## 3.4 [00-scripts/step4](00-scripts/step4) : genetic Diversity calculation - BASH scripts * [gdval_by_site.sh](00-scripts/step4/gdval_by_site.sh) : generates CSV files with 2 columns : cell ID and mean genetic diversity per species into the cell - JULIA scripts * [equalarea_numbers.jl](00-scripts/step4/equalarea_numbers.jl) : attributes mean genetic diversity at each equal area grid cell. Genetic diversity is calculated from master data matrices * [metrics_by_area_and_species.jl](00-scripts/step4/metrics_by_area_and_species.jl): generates files for statistical analysis at next step : mean genetic diversity per cell, genetic diversity per species per cell, number of individuals per species, number of species per cell, cell coordinates, cell ID... * [Lib_GD_summary_functions.jl](00-scripts/step4/Lib_GD_summary_functions.jl) : functions to calculate genetic diversity at species level and cell level ## 3.5 [00-scripts/step5](00-scripts/step5) : statistical analysis - R scripts * `descripteurs.R` * `figures.R` WORK IN PROGRESS !!! # 4. Reporting bugs If you're sure you've found a bug — e.g. if one of my programs crashes with an obscur error message, or if the resulting file is missing part of the original data, then by all means submit a bug report. I use [GitLab's issue system](https://gitlab.com/reservebenefit/worldmap_fish_genetic_diversity/issues) as my bug database. You can submit your bug reports there. Please be as verbose as possible — e.g. include the command line, etc # 5. Running the pipeline ## 5.1 Filter raw data ``` bash ./00-scripts/step1/filter_raw_data.sh ``` ## 5.2 Georeferenced sequences alignments by species ``` bash ./00-scripts/step2/seq_alnt_filtered_data.sh mkdir ./06-species_alnt_cluster/total mkdir ./06-species_alnt_cluster/freshwater mkdir ./06-species_alnt_cluster/marine bash ./00-scripts/step2/cluster_freshwater_vs_marine.sh Rscript ./00-scripts/step2/equalareacoords.R ``` ## 5.3 Species sequence pairwise comparison ``` julia ./00-scripts/step3/master_matrices.jl ``` ## 5.4 Genetic Diversity calculation ``` julia ./00-scripts/step4/equalarea_numbers.jl bash ./00-scripts/step4/gdval_by_site.sh julia ./00-scripts/step4/metrics_by_area_and_species.jl ``` ## 5.5 Statistical analysis ``` Rscript ./00-scripts/step5/descripteurs.R Rscript ./00-scripts/step5/figures.R ```