Commit e087075b authored by RomainFeron's avatar RomainFeron
Browse files

Basic doc structure, getting started, and beginning of usage

parent cbe7e0bc
This diff is collapsed.
Getting started
===============
Installation
------------
Requirements
~~~~~~~~~~~~
* A C++11 compliant compiler (GCC >= 4.8.1, Clang >= 3.3)
* The zlib library (which should be installed on linux by default)
.. _install-release:
Installation
~~~~~~~~~~~~
RADSex can be installed from one of the release packages [TODO: link], or the latest stable development version can be installed directly from the GitHub repository.
**1. Install the latest release**
TODO
**2. Install from GitHub**
To install the latest stable version of RADSex from the GitHub repository, run the following commands:
::
git clone https://github.com/RomainFeron/RadSex.git
cd RadSex
make
The compiled **radsex** binary will be located in **RadSex/bin/**.
Update RADSex
~~~~~~~~~~~~~
To update RADSex, you can download the latest stable release and install it as described in the :ref:`install-release` section.
If you installed RADSex from Github, run the following commands in the RadSex directory:
::
git pull
make rebuild
Before starting
---------------
Before running the pipeline, you should prepare the following elements:
* A **set of demultiplexed reads**. The current version of RADSex does not implement demultiplexing. Raw sequencing reads can be demultiplexed using `Stacks <http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php>`_ or `pyRAD <http://nbviewer.jupyter.org/gist/dereneaton/af9548ea0e94bff99aa0/pyRAD_v.3.0.ipynb#The-seven-steps-described>`_.
* A **population map**: a tabulated file with individual ID as the first column and sex as the second column. It is important that the individual IDs in the popmap are the same as the names of the demultiplexed reads files (see the [popmap section](#population-map) for details).
* If you want to map the sequences to a reference genome: a **reference genome** in fasta format.
.. note:: When visualizing `mapping` results with `radsex-vis`, linkage groups / chromosomes are automatically inferred from scaffold names in the reference sequence if their name starts with *LG*, *CHR*, or *NC* (case unsensitive). If chromosomes are named differently in the reference genome, you should prepare a tabulated file with reference scaffold ID in the first column and corresponding chromosome name in the second column (see the [chromosomes names section](#chromosomes-names) for details).
Running RADSex
--------------
.. _computing-cov-table:
Computing the coverage table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The first step of RADSex is to create a table of coverage for the dataset using the ``process`` command:
::
radsex process --input-dir ./samples --output-file coverage_table.tsv --threads 16 --min-coverage 1
In this example, demultiplexed reads are stored in **./samples** and the coverage table generated by ``process`` will be stored in **coverage_table.tsv**. The parameter ``--threads`` specifies the number of threads to use, and ``--min-coverage`` specifies the minimum coverage to consider a marker present in an individual: markers which are not present with coverage higher than this value in at least one individual will not be retained in the coverage table.
It is advised to keep the minimum coverage to 1 for this step, as it can be adjusted for each analysis later.
The resulting file **coverage_table.tsv** is a table with N + 2 columns, where *N* is the number of individuals in the dataset :
* **ID** : marker ID.
* **Sequence** : marker sequence.
* For each individual, marker coverage.
Computing the distribution of sequences between sexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After generating the coverage table, the ``distrib`` command is used to compute the distribution of sequences between sexes:
::
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-coverage 5``
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` section, and the distribution of sequences between sexes will be stored in **distribution.tsv**.
The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details).
The minimum coverage to consider a sequence present in an individual is set to 5, meaning that sequences with coverage lower than 5 in an individual will not be considered present in this individual.
The resulting file **distribution.tsv** is a table with five columns:
* **Males** : number of males in which a sequence was present.
* **Females** : number of females in which a sequence was present.
* **Sequences** : number of sequences present in the corresponding number of males and females.
* **P** : p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
This distribution can be visualized with the ``plot_sex_distribution()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap of sequences with males on the x-axis and females on the y-axis.
Extracting sequences significantly associated with sex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequences significantly associated with sex can be obtained with the ``signif`` command:
::
radsex signif --input-file coverage_table.tsv --output-file sequences.tsv --popmap-file popmap.tsv --min-coverage 5 [ --output-format fasta ]
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, and the sequences significantly associated with sex are outputed in **sequences.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details), and the minimum coverage to consider a sequence present in an individual is set to 5, meaning that sequences with coverage lower than 5 in an individual will not be considered present in this individual.
By default, the ``signif`` function generates an output file in the same format as the coverage table. However, sequences can be exported to fasta using the ``--output-format`` parameter.
The coverage table generated by ``signif`` can be visualized with the ``plot_coverage()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap showing the coverage of each sequence in each individual.
Mapping sequences to a reference genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequences can be mapped to a reference genome using the ``map`` command:
::
radsex map --input-file coverage_table.tsv --output-file mapping.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-coverage 5
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, the mapping results will be stored in **sequences.tsv**,
and the path to the reference genome file is given by ``--genome-file``. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details),
and the minimum coverage to consider a sequence present in an individual is set to 5, meaning that sequences with coverage lower than 5 in an individual will not be considered present in this individual. The parameter ``--min-quality`` specifies the minimum mapping quality (as defined in `BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_) to consider a sequence properly mapped, and is here set to 20. The parameter ``--min-frequency`` specifies the minimum frequency of a sequence in at least one sex; it is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of one sex are retained for mapping.
The resulting file ``mapping.tsv`` is a table with five columns:
* **Sequence :** ID of the mapped sequence.
* **Contig :** ID of the contig where the sequence mapped.
* **Position :** position of the mapped sequence on the contig.
* **SexBias :** sex-bias of the mapped sequence, defined as (Males / Total males ) - (Females / Total females), where *Males* and *Females* are the number of males and number of females in which the sequence is present, respectively, and *Total males* and *Total females* are the total number of males and females in the population, respectively.
* **P :** p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
The mapping results generated by ``map`` can be visualized with the ``plot_genome()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a circular plot with the sex-bias and association with sex of each marker mapped on the genome.
Mapping results for a specific contig can be visualized with the ``plot_scaffold()`` function to show the same metrics for a single contig.
Installation
===============
Requirements
------------
* A C++11 compliant compiler (GCC >= 4.8.1, Clang >= 3.3)
* The zlib library (which should be installed on linux by default)
Installation
------------
RADSex can be installed from one of the release packages [TODO: link], or the latest stable development version can be installed directly from the GitHub repository.
.. _install-release:
Install the latest release
~~~~~~~~~~~~~~~~~~~~~~~~~~
TODO
Install from GitHub
~~~~~~~~~~~~~~~~~~~
To install the latest stable version of RADSex from the GitHub repository, run the following commands:
::
git clone https://github.com/RomainFeron/RadSex.git
cd RadSex
make
The compiled **radsex** binary will be located in **RadSex/bin/**.
Update RADSex
-------------
To update RADSex, you can download the latest stable release and install it as described in :ref:`install-release`.
If you installed RADSex from Github, run the following commands in the RadSex directory:
::
git pull
make rebuild
......@@ -3,18 +3,25 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to RADSex's documentation!
==================================
RADSex documentation
====================
.. toctree::
:maxdepth: 2
:caption: Contents:
RADSex is a software package to analyze RAD-Sequencing data. It is primarily designed to look for sex signal, but it can be used to compare two populations.
The core idea of RADSex is to compare presence / absence of non-polymorphic markers between individuals in two populations. In the case of a polygenic RAD-Sequencing marker, each allele is represented as a separate marker, whereas other RAD-Sequencing analysis softwares usually group alleles in a single polymorphic marker. Separating alleles from polymorphic markers enables RADSex to easily detect sex-specific alleles, involving a single simple parameter.
RADSex uses a dataset of demultiplexed RAD reads to generates a table of coverage for each sequence in each individual. This table is then used to infer information about the type of sex-determination system, identify sex-biased sequences, map the RAD sequences to a reference genome, and recover potential polymorphic sex-biased markers.
Results from RADSex can be visualized with the `radsex-vis` R package, available here: https://github.com/INRA-LPGP/radsex-vis. The `radsex-vis` R package provides easy-to-use functions to generate powerful visual representations of your data.
Indices and tables
==================
Documentation summary
---------------------
.. toctree::
:maxdepth: 2
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
getting_started
usage
input_files
output_files
license
Input file formats
==================
Population map
--------------
Chromosomes names file
----------------------
This diff is collapsed.
getting_started/installation
getting_started/before-starting
getting_started/computing-coverage-table
getting_started/computing-distrib
getting_started/extract-sex-sequences
getting_started/map-sex-sequences
.. _full-usage:
.. toctree::
:maxdepth: 2
:caption: RADSex usage details
usage/general
usage/process
usage/distrib
usage/subset
usage/signif
usage/map
usage/freq
.. _input-file-formats:
.. toctree::
:maxdepth: 2
:glob:
:caption: Input files information
input_files/population-map
input_files/chromosomes-names
.. _output-file-formats:
.. toctree::
:maxdepth: 2
:caption: Output files information
output_files/coverage-table
output_files/fasta-file
output_files/sex-distribution
output_files/mapping
.. _radsex-vis-figures:
.. toctree::
:maxdepth: 2
:caption: RADSex figures examples
figures/sex-distribution
figures/clustering
figures/mapping-genome
figures/mapping-contig
.. _example-cases:
.. toctree::
:maxdepth: 2
:caption: Examples of RADSex output and interpretation
examples/xy_simple
examples/zw_simple
examples/xy_outliers
examples/zw_outliers
examples/xy_multiple_alleles
.. _extra:
.. toctree::
:maxdepth: 2
:caption: Extra
license
Output file formats
==================
Coverage table files
--------------------
Distribution of sequences between sexes
---------------------------------------
Fasta files
-----------
Mapping results
---------------
RADSex usage details
====================
General
-------
The RADSex software presents the general command-line interface:
``radsex <command> [options]``
**Available commands**
======= ===========
Command Description
======= ===========
process Compute a table of coverage from a set of demultiplexed reads
distrib Compute the distribution of sequences between sexes
subset Extract a subset of the coverage table
signif Extract sequences significantly associated with sex
loci Recreate polymorphic loci from a subset of coverage table
mapping Map a subset of sequences (coverage table or fasta) to a reference genome and output sex-association metrics for each mapped sequence
freq Compute sequence frequencies for the population
======= ===========
process
-------
**Command**
::
radsex process --input-dir input_dir_path --output-file output_file_path [ --threads n_threads --min-coverage min_cov ]
The ``process`` command generates a table showing the coverage of each marker in each individual of the dataset. The output is a tabulated file, where each line contains the ID, sequence, and coverage for each individual of a marker.*
**Options**
================== ===========
Option Description
================== ===========
``--input-dir`` Path to a folder containing demultiplexed reads
``--output-file`` Path to the output file
``--threads`` Number of threads to use (default: 1)
``--min-coverage`` Minimum coverage to consider a sequence in an individual (default: 1)
================== ===========
**Sample output**
::
ID Sequence individual_1 individual_2 individual_3 individual_4 individual_5
0 TGCA..TATT 0 15 24 17 21
1 TGCA..GACC 20 18 3 26 4
2 TGCA..ATCG 2 1 5 16 0
3 TGCA..CCGA 14 29 23 2 19
distrib
-------
**Command**
::
radsex distrib --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path [ --min-coverage min_cov --output-matrix ]
The ``distrib`` command generates a table containing the number of sequences present with coverage higher than min_cov in *M* males and *F* females for every combination of number of males *M* and number of females *F*.
**Options**
=================== ===========
Option Description
=================== ===========
``--input-file`` Path to a folder containing demultiplexed reads
``--output-file`` Path to the output file
``--popmap-file`` Path to a popmap file indicating the sex of each individual
``--min-coverage`` Minimum coverage to consider a sequence in an individual (default: 1)
``--output-matrix`` If true, outputs the results as a matrix with males in columns and females in rows instead of a table (default: 0)
=================== ===========
**Sample output**
::
Males Females Sequences P Signif
0 1 7 1 False
0 2 3 0.39 False
0 3 1 0.10 False
1 0 6 1 False
1 1 5 1 False
1 2 1 1 False
1 3 2 0.39 False
2 0 3 0.39 False
2 1 8 1 False
2 2 4 1 False
2 3 2 1 False
3 0 4 0.10 False
3 1 7 0.39 False
3 2 6 1 False
3 3 9 1 False
subset
------
**Command**
::
radsex subset --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path [ --output-format output_format --min-coverage min_cov --min-males min_males --min-females min_females --max-males max_males --max-females max_females --min-individuals min_individuals --max-individuals max_individuals]
The ``subset`` command filters the coverage table to only export sequences present in any combination of M males and F females, with min_males ≤ M ≤ max_males, min_females ≤ F ≤ max_females, and min_individuals ≤ M + F ≤ max_individuals.
**Options**
===================== ===========
Option Description
===================== ===========
``--input-file`` Path to an coverage table obtained with ``process``
``--output-file`` Path to the output file
``--popmap-file`` Path to a popmap file indicating the sex of each individual
``--output-format`` Output format, either "table" or "fasta" (default: "table")
``--min-coverage`` Minimum coverage to consider a sequence present in an individual (default: 1)
``--min-males`` Minimum number of males with a retained sequence (default: 0)
``--min-females`` Minimum number of females with a retained sequence (default: 0)
``--max-males`` Maximum number of males with a retained sequence (default: all)
``--max-females`` Maximum number of females with a retained sequence (default: all)
``--min-individuals`` Minimum number of individuals with a retained sequence (default: 1)
``--max-individuals`` Maximum number of individuals with a retained sequence (default: all)
===================== ===========
**Sample output**
* Table format :
::
ID Sequence individual_1 individual_2 individual_3 individual_4 individual_5
15 TGCA..TATT 0 15 24 17 21
27 TGCA..GACC 20 18 3 26 4
43 TGCA..ATCG 2 1 5 16 0
86 TGCA..CCGA 14 29 23 2 19
* FASTA format :
::
>15_5_0_cov:5
TGCA..TATT
>15_5_1_cov:5
TGCA..GACC
>15_5_1_cov:5
TGCA..ATCG
>15_5_0_cov:5
TGCA..CCGA
signif
------
map
----
freq
----
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment