Commit 545dee1c authored by Romain Feron's avatar Romain Feron
Browse files

Merge branch 'master' of github.com:INRA-LPGP/radseq_analyses_pipeline

parents 9f5c4a59 dc3a59ff
MIT License
Copyright (c) 2017 Romain Feron
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
## RADSeq_analysis
### Overview
This software is part of the RADSex pipeline, a method to detect sex-linked sequences in RADSeq data. As such, it is only designed to work with the output of Stacks using specific parameter values. This pipeline was developed for the PhyloSex project, which investigates sex determining factors in a wide range of fish species.
### Requirements
- Python 3.5 or higher
- *Optional (for visualization)* : R 3.3 or higher with the following packages:
+ readr
+ ggplot2
+ reshape2
+ ggdendro
+ grid
+ gtable
+ base
+ svglite
+ scales
- *Optional* : package `progress` to display progress bars in lengthy steps
### Installation
- Clone: `git clone git@github.com:INRA-LPGP/radseq_analysis.git`
- Download the archive and unzip it
- *Optional* : install recommended python packages with `pip3 install -r requirements.txt`
- *Optional* : install R packages for visualization with `Rscript install_packages.R`
### Usage
#### General
`python3 radseq_analysis.py <command> [options]`
**Available commands** :
Command | Description
------------- | ------------
`heatmap` | Generates a matrix of haplotypes sex distribution
`haplotypes` | Extract haplotypes present in a given number of males and females
`frequencies` | Calculate haplotypes frequencies distribution in the population
`rescue` | Find all alleles in sex-linked loci from the Stacks catalog
`visualize` | Visualize analyses results using R
<br/>
#### Heatmap
`python3 radseq_analysis.py heatmap -i input_folder -m popmap [-o output_file]`
*Generates a matrix of dimension (Number of males) x (Number of females). The value at coordinates **(i, j)** corresponds to the number of haplotypes found in precisely **i** males and **j** females.*
**Options** :
Option | Full name | Description
--- | --- | ---
`-i` | `--input-folder` | Path to a folder containing the output of denovo_map |
`-m``--popmap` | Path to a population map file |
`-o``--output-file` | Path to the output file (default: *haplotypes_matrix.tsv*)
<br/>
#### Haplotypes
`python3 radseq_analysis.py haplotypes -i input_folder -m popmap -p positions_list [-o output_file]`
*Extracts all the haplotypes found in a given number of males and females (a position in the haplotypes matrix). The output is a tabulated file with the following fields :*
- *Locus* : catalog ID of the haplotype
- *Males* : number of males in which the haplotype was found
- *Females* : number of females in which the haplotype was found
- *Sequence* : haplotype sequence
- *Male_outliers* : ID of the males in which the haplotype was not found
- *Female_outliers* : ID of the males in which the haplotype was not found
**Options** :
Option | Full name | Description
--- | --- | ---
`-i` | `--input-folder` | Path to a folder containing the output of denovo_map |
`-m``--popmap` | Path to a population map file |
`-p``--positions` | Path to a file containing the list of positions to extract |
`-o``--output-file` | Path to the output file (default: *extracted_haplotypes.tsv*)
<br/>
#### Frequencies
`python3 radseq_analysis.py frequencies -i input_folder [-o output_file]`
*Computes the distribution of haplotypes frequencies in the population. The output is a tabulated file with the following fields*
- *Frequency* : number of individuals in which a haplotype is found
- *Count* : number of haplotypes with the associated frequency in the population
**Options** :
Option | Full name | Description
--- | --- | ---
`-i` | `--input-folder` | Path to a folder containing the output of denovo_map |
`-o``--output-file` | Path to the output file (default: *haplotypes_frequencies.tsv*)
<br/>
#### Rescue
`python3 radseq_analysis.py rescue -i input_folder -s sequences_file [-c coverage_file -o output_file]`
*Find all alleles in sex-linked loci from the Stacks catalog by blasting sex-linked sequences and filtering by similarity. If a coverage file is provided, loci coverage will be corrected based on global coverage differences between individuals. The output is a tabulated file with the following fields :*
- Stack_ID: catalog ID of the sex-linked haplotype
- Haplotype_ID: catalog ID of the rescued haplotype (other allele)
- Sequence: rescued haplotype sequence
- Matches: number of matching bases between the sex-linked haplotype and the rescued haplotype
- Mismatches: number of non-matching bases between the sex-linked haplotype and the rescued haplotype
- Gaps: number of gaps between the sex-linked haplotype and the rescued haplotype
**Options** :
Option | Full name | Description
--- | --- | ---
`-i` | `--input-folder` | Path to a folder containing the output of denovo_map |
`-s``--sequences` | Path to a sequences file (result of *haplotypes*)|
`-c``--coverage-file` | Path to a coverage file (result of *coverage*) |
`-o``--output-file` | Path to the output file (default: *extracted_alleles.tsv*)
<br/>
#### Visualize
`python3 radseq_analysis.py visualize -i input_file -o output_file -m popmap`
*Generate plots to visualize output from heatmap, rescue, or frequencies commands. The input file type is automatically detected. The following plots are generated :*
- **heatmap** : a heatmap representation of the loci matrix, with number of males as abscissis and number of females as ordinates. The color of a tile at position **(i, j)** shows the number of haplotypes shared by exactly **i** males and **j** females.
- **rescue** : two heatmaps reprensentations of the results of clustering for both alleles and individuals; in the first heatmap, the values are presence/absence of loci. In the second heatmap, the values are individual coverage for each locus.
- **frequencies** : a barplot showing the distribution of haplotypes frequencies in the population.
**Options** :
Option | Full name | Description
--- | --- | ---
`-i` | `--input-file` | Path to a file generated by this pipeline |
`-m``--popmap` | Path to a population map file |
`-o``--output-file` | Path to the output file |
**Examples** :
- heatmap :
![Heatmap](./examples/plots/heatmap.png)
- clustering :
![Presence/Absence](./examples/plots/presence_clustering.png)
![Coverage](./examples/plots/coverage_clustering.png)
- frequencies:
![Coverage](./examples/plots/frequencies.png)
### LICENSE
MIT License
Copyright (c) 2017 Romain Feron
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
import os
import itertools
analysis = 'denovo_map_par'
species = ['perca_fluviatilis', 'ameirus_melas']
m_values = [5, 10]
n_values = [1, 2, 3, 4]
M_values = [1, 2, 3, 4]
root_dir = '/work/rferon/'
species_dir = os.path.join(root_dir, 'species')
batch_dir = os.path.join(root_dir, 'batch_processing')
shell_dir = os.path.join(batch_dir, 'shell', analysis, 'analyses')
python_dir = os.path.join(batch_dir, 'python', analysis)
qsub_dir = os.path.join(batch_dir, 'qsub')
qsub_all = os.path.join(qsub_dir, analysis + '_analyses.sh')
qsub_file = open(qsub_all, 'w')
for specie, m, n, M in itertools.product(species, m_values, n_values, M_values):
species_short = specie.split('_')[0][0] + '_' + specie.split('_')[1]
analysis_path = os.path.join(species_dir, specie, 'analyses', 'rad_seq', analysis)
results_path = os.path.join(analysis_path, 'results_' +
'm_' + str(m) + '_' +
'n_' + str(n) + '_' +
'M_' + str(M))
shell_path = os.path.join(shell_dir,
specie + '_' +
'm_' + str(m) + '_' +
'n_' + str(n) + '_' +
'M_' + str(M) + '.sh')
python_path = os.path.join(python_dir,
specie + '_' +
'm_' + str(m) + '_' +
'n_' + str(n) + '_' +
'M_' + str(M) + '.py')
output_dir_base = os.path.join(batch_dir, 'results', analysis, specie)
if not os.path.isdir(output_dir_base):
os.mkdir(output_dir_base)
output_dir = os.path.join(output_dir_base, 'm_' + str(m) + '_' +
'n_' + str(n) + '_' +
'M_' + str(M))
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
python_file = open(python_path, 'w')
python_file.write("import sys\n\n")
python_file.write("sys.path.append('/work/rferon/tools/radseq_analyses_pipeline')\n\n")
python_file.write("from radseq_analyses_pipeline import analyse_directory\n\n")
python_file.write("analyse_directory('" + results_path + "', '" + output_dir + "')\n")
shell_file = open(shell_path, 'w')
shell_file.write('python3 ' + python_path + '\n')
qsub_file.write('qsub ' + shell_path + '\n')
import os
import sys
from radseq_analyses_pipeline import analyse_directory
from radseq_analyses_pipeline.parameters.positions import get_positions_list
species = 'ameirus_melas'
replicate = None
m_value = 3
positions_file_path = sys.argv[1]
positions_list = get_positions_list(positions_file_path)
root_dir = '/work/bimarazene/radseq_all/'
extraction_dir = os.path.join(root_dir, 'analyses', 'extraction')
if not os.path.isdir(os.path.join(extraction_dir, species)):
os.mkdir(os.path.join(extraction_dir, species))
if replicate:
if not os.path.isdir(os.path.join(extraction_dir, species, replicate)):
os.mkdir(os.path.join(extraction_dir, species, replicate))
files_dir = os.path.join(root_dir, 'results', species, replicate, 'm_' + str(m_value))
analyses_dir = os.path.join(extraction_dir, species, replicate, 'm_' + str(m_value))
else:
files_dir = os.path.join(root_dir, 'results', species, 'm_' + str(m_value))
analyses_dir = os.path.join(extraction_dir, species, 'm_' + str(m_value))
# Path to population map
popmap_path = os.path.join(root_dir, 'data', 'popmaps', species + '_popmap.csv')
analyse_directory(root_dir, files_dir, analyses_dir,
positions_list, popmap_path)
if (suppressMessages(suppressWarnings(!require(readr)))) install.packages("readr")
if (suppressMessages(suppressWarnings(!require(ggplot2)))) install.packages("ggplot2")
if (suppressMessages(suppressWarnings(!require(reshape2)))) install.packages("reshape2")
if (suppressMessages(suppressWarnings(!require(ggdendro)))) install.packages("ggdendro")
if (suppressMessages(suppressWarnings(!require(grid)))) install.packages("grid")
if (suppressMessages(suppressWarnings(!require(gtable)))) install.packages("gtable")
if (suppressMessages(suppressWarnings(!require(base)))) install.packages("base")
if (suppressMessages(suppressWarnings(!require(svglite)))) install.packages("svglite")
if (suppressMessages(suppressWarnings(!require(scales)))) install.packages("scales")
from .analyse_directory import analyse_directory
from radseq_analysis.analyse_directory import analyse_directory
from .catalog import get_info_from_catalog
from .haplotypes import get_haplotypes
from .individual_files import get_individual_sequences
from .popmap import load_popmap
from .positions import load_positions_list
from .sequences import get_sequences
from .coverage import get_coverage
from radseq_analysis.file_handler.catalog import get_info_from_catalog
from radseq_analysis.file_handler.haplotypes import get_haplotypes
from radseq_analysis.file_handler.individual_files import get_individual_sequences
from radseq_analysis.file_handler.popmap import load_popmap
from radseq_analysis.file_handler.positions import load_positions_list
from radseq_analysis.file_handler.sequences import get_sequences
from radseq_analysis.file_handler.coverage import get_coverage
from .sex_linked_haplotypes import analysis as sex_linked_haplotypes
from .loci_matrix import analysis as loci_matrix
from .stacks_privacy import analysis as stacks_privacy
from .rescue import analysis as rescue
from .analysis import analysis
from radseq_analysis.modules.sex_linked_haplotypes import analysis as sex_linked_haplotypes
from radseq_analysis.modules.loci_matrix import analysis as loci_matrix
from radseq_analysis.modules.stacks_privacy import analysis as stacks_privacy
from radseq_analysis.modules.rescue import analysis as rescue
from radseq_analysis.modules.visualize import analysis as visualization
from radseq_analysis.modules.analysis import analysis
......@@ -4,11 +4,13 @@ from radseq_analysis.modules import sex_linked_haplotypes
from radseq_analysis.modules import loci_matrix
from radseq_analysis.modules import stacks_privacy
from radseq_analysis.modules import rescue
from radseq_analysis.modules import visualization
from radseq_analysis.file_handler import load_popmap
from radseq_analysis.file_handler import load_positions_list
def analysis(input_dir=None,
input_file_path=None,
popmap_file_path=None,
output_file_path=None,
positions_file_path=None,
......@@ -19,22 +21,26 @@ def analysis(input_dir=None,
parameters = Parameters(files_dir=input_dir,
output_file_path=output_file_path)
species = None
if popmap_file_path:
load_popmap(popmap_file_path, parameters)
species = os.path.split(popmap_file_path)[1]
species = os.path.splitext(species)[0]
species = '_'.join(s for s in species.split('_')[:-1])
parameters.species = species
if len(species.split('_')) > 1:
species = '_'.join(s for s in species.split('_')[:-1])
parameters.species = species
if positions_file_path:
load_positions_list(positions_file_path, parameters)
# TODO: support more batch numbers + move this inside analyses?
haplotypes_file_path = os.path.join(input_dir, 'batch_0.haplotypes.tsv')
catalog_file_path = os.path.join(input_dir, 'batch_0.catalog.tags.tsv.gz')
individual_files_paths = [os.path.join(input_dir, f) for
f in os.listdir(input_dir) if
'tags' in f and 'catalog' not in f]
if input_dir:
haplotypes_file_path = os.path.join(input_dir, 'batch_0.haplotypes.tsv')
catalog_file_path = os.path.join(input_dir, 'batch_0.catalog.tags.tsv.gz')
individual_files_paths = [os.path.join(input_dir, f) for
f in os.listdir(input_dir) if
'tags' in f and 'catalog' not in f]
if analysis == 'heatmap':
loci_matrix(haplotypes_file_path, parameters)
......@@ -44,3 +50,5 @@ def analysis(input_dir=None,
stacks_privacy(catalog_file_path, parameters)
elif analysis == 'rescue':
rescue(sequences_file_path, catalog_file_path, individual_files_paths, coverage_file_path, parameters)
elif analysis == 'visualize':
visualization(input_file_path, popmap_file_path, output_file_path, parameters)
from radseq_analysis import visualization
def detect_plot_type(input_file_path):
input_file = open(input_file_path)
line = input_file.readline()
if line == 'Frequency\tCount\n':
return 'frequencies'
elif line.startswith('Stack_ID\tHaplotype_ID\tSequence\tMatches\tMismatches\tGaps'):
return 'clustering'
else:
temp = set()
fields = line[:-1].split('\t')
temp.add(len(fields))
loci_matrix = True
for f in fields:
try:
int(f)
except ValueError:
loci_matrix = False
for line in input_file:
if line:
fields = line[:-1].split('\t')
temp.add(len(fields))
for f in fields:
try:
int(f)
except ValueError:
loci_matrix = False
if loci_matrix and len(temp) == 1:
return 'haplotypes'
else:
return 'error'
def analysis(input_file_path, popmap_file_path, output_file_path, parameters):
plot_type = detect_plot_type(input_file_path)
if plot_type == 'haplotypes':
print('Generating plot from haplotypes matrix')
visualization.haplotypes(input_file_path,
output_file_path,
parameters.species)
elif plot_type == 'clustering':
print('Generating plot from alleles file for clustering')
visualization.clustering(input_file_path,
popmap_file_path,
output_file_path)
elif plot_type == 'frequencies':
print('Generating plot from frequencies file')
visualization.frequencies(input_file_path,
output_file_path)
elif plot_type == 'error':
print('\nError: could not detect input file type. Input file should be :\n' +
'\t- a haplotypes matrix generated by the heatmap command\n' +
'\t- an allele file generated by the rescue command\n' +
'\t- a frequencies file generated by the frequencies command\n')
from .loci_matrix import loci_matrix
from .sex_linked_haplotypes import sex_linked_haplotypes
from .stacks_privacy import stacks_privacy
from .stacks import stacks
from radseq_analysis.output.loci_matrix import loci_matrix
from radseq_analysis.output.sex_linked_haplotypes import sex_linked_haplotypes
from radseq_analysis.output.stacks_privacy import stacks_privacy
from radseq_analysis.output.stacks import stacks
......@@ -15,6 +15,7 @@ Command: heatmap\tGenerates a matrix of haplotypes sex distribution
\t haplotypes\tExtract haplotypes present in a given number of males and females
\t frequencies\tCalculate haplotypes frequencies distribution in the population
\t rescue\tRegroup stacks into alleles after analysis
\t visualize\tVisualize analyses results using R
'''
)
parser.add_argument('command', help='Command to run', nargs='?')
......@@ -134,6 +135,8 @@ Options: -i\t--input-folder\tPath to a folder containing the output of denovo_m
Options: -i\t--input-folder\tPath to a folder containing the output of denovo_map
\t -s\t--sequences\tPath to sequences file (result of haplotypes analysis)
\t -m\t--popmap\tPath to population map
\t -c\t--coverage-file\tPath to a coverage file (result of coverage analysis)
\t -o\t--output-file\tPath to output file (default: extracted_alleles.tsv)
''')
parser.add_argument('--input-folder', '-i',
......@@ -169,3 +172,40 @@ Options: -i\t--input-folder\tPath to a folder containing the output of denovo_m
output_file_path=args.output_file,
coverage_file_path=args.coverage_file,
analysis='rescue')
def visualize(self):
parser = argparse.ArgumentParser(
description='Visualize analyses results using R',
usage='''python3 radseq_analysis.py visualize -i input_file -o output_file -m popmap
Options: -i\t--input-file\tPath to a file generated by this pipeline (haplotype_matrix, frequencies, ...)
\t -o\t--output-file\tPath to output file or output directory
\t -m\t--popmap\tPath to population map
''')
parser.add_argument('--input-file', '-i',
help='Path to a file generated by this pipeline')
parser.add_argument('--output-file', '-o',
help='Path to output file / directory')
parser.add_argument('--popmap', '-m',
help='Path to a popmap file')
args = parser.parse_args(sys.argv[2:])
if not args.input_file or not os.path.isfile(args.input_file):
print('\nError: no valid input file specified\n')
parser.print_usage()
print()
exit(1)
if not args.output_file:
print('\nError: no valid output file specified\n')
parser.print_usage()
print()
exit(1)
if not args.popmap or not os.path.isfile(args.popmap):
print('\nError: no valid popmap file specified\n')
parser.print_usage()
print()
exit(1)
analysis(input_file_path=args.input_file,
popmap_file_path=args.popmap,
output_file_path=args.output_file,
analysis='visualize')
from .locus import Locus
from .commons import *
from .parameters import Parameters
from .stack import Stack
from .stack import Haplotype
from radseq_analysis.shared.locus import Locus
from radseq_analysis.shared.commons import *
from radseq_analysis.shared.parameters import Parameters
from radseq_analysis.shared.stack import Stack
from radseq_analysis.shared.stack import Haplotype
......@@ -13,7 +13,8 @@ class Parameters:
popmap={},
n_males=0,
n_females=0,
species=None
species=None,
m_value=None
):
self.files_dir = files_dir
......@@ -27,3 +28,4 @@ class Parameters:
self.n_males = n_males
self.n_females = n_females
self.species = species
self.m_value = m_value
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0) {
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args) != 3){
stop("Usage: R haplotypes.R input_file.tsv output_dir treshold")
}
suppressMessages(library(readr))
suppressMessages(library(ggplot2))
suppressMessages(library(gridExtra))
file = args[1]
output_dir = args[2]
threshold = args[3]
png_name = paste("sex_variable_haplotypes.png", sep='')
data <- suppressMessages(read_delim(file, "\t", col_names = TRUE, escape_double = FALSE, trim_ws = TRUE))
names(data) = c('Locus', 'Male_haplotype', 'Male_haplotype_number',
'Female_haplotype', 'Female_haplotype_number',
'Males', 'Females', 'Male_outliers',
'Female_outliers', 'Consensus')
males = data.frame(table(data$Male_haplotype_number[which(data$Male_haplotype_number > threshold)], data$Female_haplotype_number[which(data$Male_haplotype_number > threshold)]))
names(males) = c('Males', 'Females', 'Count')
females = data.frame(table(data$Male_haplotype_number[which(data$Female_haplotype_number > threshold)], data$Female_haplotype_number[which(data$Female_haplotype_number > threshold)]))
names(females) = c('Males', 'Females', 'Count')
g <- ggplot(males, aes(x = Males, y = Count, fill = Females))
g = g + geom_bar(stat='identity', position=position_dodge(), colour='black')
h <- ggplot(females, aes(x = Females, y = Count, fill = Males))
h = h + geom_bar(stat='identity', position=position_dodge(), colour='black')