Commit 2783c62a authored by Romain Feron's avatar Romain Feron
Browse files

Updated doc with full usage and a quick start guide

parent 7fa84ee6
# RadSex
# RADSex
The RADSex pipeline is **currently under development** and has not been officially released yet. Missing features are been implemented, and some bugs are to be expected in this current development version. Please contact me by email or on Github, or open an issue if you encounter bugs or would like to discuss a feature !
## Overview
The RADSex pipeline implements several functions for the analysis of RAD-Sequencing data with focus on sex. This pipeline was developed for the PhyloSex project, which investigates sex determining factors in a wide range of fish species.
RADSex is a software package for the analysis of sex-determination using RAD-Sequencing data. The `process` function generates a data structure summarizing a set of demultiplexed RAD reads, and other functions use this data structure to infer information about the type of sex-determination system, identify sex-biased sequences, and map the RAD sequences to a reference genome. The results of RADSex are meant to be visualized with the `radsex-vis` R package, available here: https://github.com/INRA-LPGP/radsex-vis.
The RADSex pipeline was developed by Romain Feron and Yann Guiguen while working at INRA, Rennes, France.
This pipeline was developed for the PhyloSex project, which investigates sex determining factors in a wide range of fish species.
## Requirements
......@@ -14,11 +16,73 @@ The RADSex pipeline was developed by Romain Feron and Yann Guiguen while working
## Installation
- Clone: `git clone git@github.com:INRA-LPGP/RadSex.git`
- Alternative: Download the archive and unzip it
- *Alternative: download the archive and unzip it*
- Go to the RadSex directory (`cd RadSex`)
- Run `make`
- The compiled `radsex` binary is located in `RadSex/bin/`
## Quick start
#### Before starting
Before running the pipeline, you should prepare the following elements:
- A **set of demultiplexed reads**. The current version of RADSex does not implement demultiplexing; raw sequencing reads can be demultiplexed using [Stacks](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) or [pyRAD](http://nbviewer.jupyter.org/gist/dereneaton/af9548ea0e94bff99aa0/pyRAD_v.3.0.ipynb#The-seven-steps-described).
- A **population map** (popmap): a tabulated file with individual ID as the first column and sex as the second column. It is important that the individual IDs in the popmap are the same as the names of the demultiplexed reads files (see the population map section).
- If you want to map the sequences to a reference genome: a **reference genome** in fasta format. Note that when visualizing `mapping` results with `radsex-vis`, linkage groups / chromosomes are automatically inferred from scaffold names in the reference sequence if their name starts with *LG*, *chr*, or *chromosome* (case unsensitive). If chromosomes are named differently in the reference genome, you should prepare a tabulated file with reference scaffold ID in the first column and corresponding chromosome name in the second column (see the chromosomes names section)
#### Computing the coverage table
The first step of RADSex is to create a table of coverage for the dataset using the `process` command:
`radsex process -d ./samples -o coverage_table.tsv -t 16 -c 1`
In this example, demultiplexed reads are stored in `./samples` and the coverage table generated by `process` will be stored in `coverage_table.tsv`. The parameter `-t` specifies the number of threads to use.
The parameter `-c` specifies the minimum coverage value to consider a sequence present in an individual: sequences which are not present with coverage higher than this value in at least one individual will not be retained in the coverage table. It is advised to keep the minimum coverage to 1 for this step, as it can be adjusted for each analysis later.
#### Computing the distribution of sequences between sexes
After generating the coverage table, the `distrib` command is used to compute the distribution of sequences between sexes:
`radsex distrib -f coverage_table.tsv -o distribution.tsv -p popmap.tsv -c 5`
In this example, the input file `-f` is the coverage table generated in the [previous step](#computing-the-coverage-table), and the distribution of sequences between sexes will be stored in `distribution.tsv`. The sex of each individual in the population is given by `popmap.tsv` (see the popmap section for details). The minimum coverage to consider a sequence present in an individual is set to 5, meaning that sequences present with coverage (depth) lower than 5 in one individual will not be counted in this individual.
The resulting file `distribution.tsv` is a table with four columns:
- **Males** : number of males in which a sequence was present.
- **Females** : number of females in which a sequence was present.
- **Sequences** : number of sequences present in the corresponding number of males and females.
- **P** : p-value of a chi-squared test for association with sex.
This distribution can be visualized with the `plot_sex_distribution()` function of `radsex-vis`, which generates a [distribution heatmap](./examples/figures/sex_distribution.png).
#### Extracting sequences significantly associated with sex
Sequences significantly associated with sex can be obtained with the `signif` command:
`radsex signif -f coverage_table.tsv -o sequences.tsv -p popmap.tsv -c 5`
In this example, the input file `-f` is the coverage table generated in the [first step](#computing-the-coverage-table), and the sequences significantly associated with sex will be stored in `sequences.tsv`. The sex of each individual in the population is given by `popmap.tsv` (see the popmap section for details), and the minimum coverage to consider a sequence present in an individual is set to 5 (see the [previous section](#computing-the-distribution-of-sequences-between-sexes)). By default, the `signif` function exports a small coverage table; sequences can be exported to fasta using the `--fasta` parameter.
The coverage table generated by `signif` can be visualized with the `plot_coverage()` function of `radsex-vis`, which generates a [coverage heatmap](./examples/figures/coverage.png)
#### Mapping sequences to a reference genome
Sequences can be mapped to a reference genome using the `map` command:
`radsex map -f coverage_table.tsv -o mapping.tsv -p popmap.tsv -g genome.fasta -q 20 --min-frequency 0.1 -c 5`
In this example, the input file `-f` is the coverage table generated in the [first step](#computing-the-coverage-table), the mapping results will be stored in `sequences.tsv`, and the path to the reference genome file is given by `-g`. The sex of each individual in the population is given by `popmap.tsv` (see the popmap section for details), and the minimum coverage to consider a sequence present in an individual is set to 5 (see the [previous section](#computing-the-distribution-of-sequences-between-sexes)). The parameter `-q` specifies the minimum mapping quality (as defined in [BWA](http://bio-bwa.sourceforge.net/bwa.shtml)) to consider a sequence mapped (`-q`), here set to 20. The parameter `--min-frequency` specifies the minimum frequency of a sequence in at least one sex; it is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of one sex are retained for mapping.
The resulting file `mapping.tsv` is a table with five columns:
- **Sequence :** ID of the mapped sequence
- **Contig :** ID of the contig where the sequence mapped
- **Position :** position of the mapped sequence on the contig
- **SexBias :** sex-bias of the mapped sequence, defined as (Males / Total males ) - (Females / Total females)
- **P :** p-value of a chi-squared test for association with sex
The mapping results generated by `map` can be visualized with the `plot_genome()` function of `radsex-vis`, which generates a [circular plot](./examples/figures/genome.png). Mapping results for a specific scaffold can be visualized with the `plot_scaffold()` function to generate a [linear plot](./examples/figures/scaffold.png).
## Usage
### General
......@@ -29,11 +93,11 @@ The RADSex pipeline was developed by Romain Feron and Yann Guiguen while working
Command | Description
------------------ | ------------
`process` | Compute a matrix of coverage from a set of demultiplexed reads files
`process` | Compute a table of coverage from a set of demultiplexed reads
`distrib` | Compute the distribution of sequences between sexes
`subset` | Extract a subset of the coverage matrix
`subset` | Extract a subset of the coverage table
`signif` | Extract sequences significantly associated with sex
`loci` | Recreate polymorphic loci from a subset of coverage matrix
`loci` | Recreate polymorphic loci from a subset of coverage table
`mapping` | Map a subset of sequences (coverage table or fasta) to a reference genome and output sex-association metrics for each mapped sequence
`freq` | Compute sequence frequencies for the population
......@@ -41,7 +105,7 @@ Command | Description
`radsex process -d input_dir_path -o output_file_path [ -t n_threads -c min_cov ]`
*Generates a matrix of coverage for all individuals and all sequences. The output is a tabulated file, where each line contains the ID, sequence and coverage for each individual of a marker.*
*Generates a table of coverage for all individuals and all sequences. The output is a tabulated file, where each line contains the ID, sequence and coverage for each individual of a sequence.*
**Options** :
......@@ -50,7 +114,7 @@ Option | Full name | Description
`-d` | `input_dir_path` | Path to a folder containing demultiplexed reads |
`-o``output_file_path` | Path to the output file |
`-t``n_threads` | Number of threads to use (default: 1) |
`-c``min_cov` | Minimum coverage to consider a marker in an individual (default: 1) |
`-c``min_cov` | Minimum coverage to consider a sequence in an individual (default: 1) |
### distrib
......@@ -61,32 +125,80 @@ Option | Full name | Description
**Options** :
Option | Full name | Description
--- | --- | ---
`-f` | `input_file_path` | Path to an coverage matrix obtained with `process` |
------ | --------- | -------------
`-f` | `input_file_path` | Path to an coverage table obtained with `process` |
`-o``output_file_path` | Path to the output file |
`-p``popmap_file_path` | Path to a popmap file indicating the sex of each individual |
`-c``min_cov` | Minimum coverage to consider a sequence present in an individual (default: 1) |
`--output-matrix``output_matrix` | If true, outputs the resutls as a matrix with males in columns and females in rows instead of a table (default: false) |
### Subset
### subset
`radsex subset -f input_file_path -o output_file_path -p popmap_file_path [ -c min_cov --min-males min_males --min-females min_females --max-males max_males --max-females max_females --min-individuals min_individuals --max-individuals max_individuals]`
*Filters the coverage matrix to only export sequences present in any combination of M males and F females, with min_males ≤ M ≤ max_males, min_females ≤ F ≤ max_females, and min_individuals ≤ M + F ≤ max_individuals*
*Filters the coverage table to only export sequences present in any combination of M males and F females, with min_males ≤ M ≤ max_males, min_females ≤ F ≤ max_females, and min_individuals ≤ M + F ≤ max_individuals*
**Options** :
Option | Full name | Description
--- | --- | ---
`-f` | `input_file_path` | Path to an coverage matrix obtained with `process` |
`-f` | `input_file_path` | Path to an coverage table obtained with `process` |
`-o``output_file_path` | Path to the output file |
`-p``popmap_file_path` | Path to a popmap file indicating the sex of each individual |
`-c``min_cov` | Minimum coverage to consider a sequence in an individual (default: 1) |
`--min-males``min_males` | Minimum number of males with the sequence |
`--min-females``min_females` | Minimum number of females with the sequence |
`--max-males``max_males` | Maximum number of males with the sequence |
`--max-females``max_females` | Maximum number of females with the sequence |
`--max-individuals``max_individuals` | Maximum number of individuals with the sequence |
`--max-individuals``max_individuals` | Maximum number of individuals with the sequence |
`-c``min_cov` | Minimum coverage to consider a sequence present in an individual (default: 1) |
`--min-males``min_males` | Minimum number of males with a retained sequence |
`--min-females``min_females` | Minimum number of females with a retained sequence |
`--max-males``max_males` | Maximum number of males with a retained sequence |
`--max-females``max_females` | Maximum number of females with a retained sequence |
`--max-individuals``max_individuals` | Maximum number of individuals with a retained sequence |
`--max-individuals``max_individuals` | Maximum number of individuals with a retained sequence |
### signif
`radsex signif -f input_file_path -o output_file_path -p popmap_file_path [ -c min_cov ]`
*Filters the coverage table to only export sequences significantly associated with sex, defined as sequences for which p < 0.05 (after Bonferroni correction), with p being the p-value of a chi-squared test on the numbers of males and females.*
**Options** :
Option | Full name | Description
--- | --- | ---
`-f` | `input_file_path` | Path to an coverage table obtained with `process` |
`-o``output_file_path` | Path to the output file |
`-p``popmap_file_path` | Path to a popmap file indicating the sex of each individual |
`-c``min_cov` | Minimum coverage to consider a sequence present in an individual (default: 1) |
### map
`radsex map -f input_file_path -o output_file_path -p popmap_file_path -g genome_file_path [ -c min_cov -q min_quality --min-frequency min_frequency ]`
*Maps the sequences from the coverage table to a reference genome and outputs mapping position, sex bias, and p-value of association with sex for each mapped sequence.*
**Options** :
Option | Full name | Description
--- | --- | ---
`-f` | `input_file_path` | Path to an coverage table obtained with `process` |
`-o``output_file_path` | Path to the output file |
`-p``popmap_file_path` | Path to a popmap file indicating the sex of each individual |
`-g``genome_file_path` | Path to a reference genome file in fasta format |
`-c``min_cov` | Minimum coverage to consider a sequence present in an individual (default: 1) |
`-q``min_quality` | Minimum mapping quality, as defined in BWA, to consider a sequence properly mapped (default: 20) |
`--min-frequency``min_frequency` | Minimum frequency in at least one sex for a sequence to be retained (default: 0.25) |
### freq
`radsex freq -f input_file_path -o output_file_path [ -c min_cov ]`
*Computes the sequences frequencies for the entire population*
**Options** :
Option | Full name | Description
--- | --- | ---
`-f` | `input_file_path` | Path to an coverage table obtained with `process` |
`-o``output_file_path` | Path to the output file |
`-c``min_cov` | Minimum coverage to consider a sequence present in an individual (default: 1) |
### LICENSE
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment