Commit 5d0bb623 authored by Romain Feron's avatar Romain Feron
Browse files

Updated process and distrib steps of the example, small fixes to other doc files to match

parent 4e450e4f
Example: running RADSex on a public *Oryzias latipes* RAD-Sequencing dataset
============================================================================
Detailed example
================
In this example, we will run RADSex on a public *Oryzias latipes* RAD-Sequencing dataset. We will detail each step of the process, highlight important details, and show how to use the R package ``radsex-vis`` to generate plots from the output of ``radsex``. This guide assumes that ``radsex`` and the ``radsex-vis`` package have already been installed. For specific instruction about installing ``radsex`` and ``radsex-vis``, check the :ref:`install-release` section. All reported times and resources usage were measured on a desktop computer with an Intel i7-8700K 4.7 GHz processor, 32 Gb of memory, and a standard 7200 RPM Hard Disk Drive.
Preparing the data
------------------
The RAD-Sequencing datasets used in this example are available on the Sequence Read Archive on NCBI. The reads were demultiplexed before being deposited on NCBI, and samples were grouped in two projects, males and females. The accession number for **female** samples is **SRS662264**, and the accession number for **male** samples is **SRS662265**. For convenience, females samples can be accessed `here <https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=2912074>`_ and male samples `here <https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=2912073>`_. This dataset was analyzed in `Wilson et al 2014 <http://www.genetics.org/content/early/2014/09/18/genetics.114.169284>`_.
The population map describing the sex of each samples for this dataset can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/population_map.tsv>`_.
The RAD-Sequencing datasets used in this example are available on the Sequence Read Archive on NCBI. The reads were demultiplexed before being deposited on NCBI, and samples were grouped in two projects, males and females. The accession number for **female** samples is **SRS662264**, and the accession number for **male** samples is **SRS662265**. For convenience, simple scripts to download male and female samples from the EBI ftp can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_female_samples.sh>`__ for female samples and `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_male_samples.sh>`_ for male samples. This dataset was published in `Wilson et al 2014 <http://www.genetics.org/content/early/2014/09/18/genetics.114.169284>`__.
The genome used for mapping markers with `radsex map` was that of a HSOK strain, NCBI accession number **GCA_002234695.1** (`link <https://www.ncbi.nlm.nih.gov/assembly/GCA_002234695.1>`_).
The population map describing the sex of each samples for this dataset can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/population_map.tsv>`__. The genome used for mapping markers with `radsex map` was that of a HSOK strain, NCBI accession number **GCA_002234695.1** (`link <https://www.ncbi.nlm.nih.gov/assembly/GCA_002234695.1>`_). The chromosomes names file used to display chromosomes with nicer names in the genome mapping plot can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/chromosomes_names.tsv>`__.
The chromosomes names file used to display chromosomes with nicer names in the genome mapping plot can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/chromosomes_names.tsv>`_.
.. note:: RADSex uses file names to generate individual IDs. Therefore, individual names in the population map should correspond to the file names without their extensions. Check the file names and population map provided above for an example of how to build the population map from file names. More details about the population map can be found in the :ref:`population-map` section.
From there, we will assume the following folder structure:
From now on, we will assume the following directory structure:
::
......@@ -25,62 +25,61 @@ From there, we will assume the following folder structure:
└─── popmap.tsv
Computing the coverage table
--------------
Generating a coverage table for the entire dataset
--------------------------------------------------
The first step of RADSex is to create a table containing the coverage of each marker in the dataset in each individual, with a marker defined as a non-polymorphic sequence (no mismatches or SNPs). To do so, we use the ``process`` command as follows :
The first step of RADSex is to create a table containing the coverage of each marker in each individual for the entire dataset; a marker is defined as a non-polymorphic sequence (no mismatches or SNPs). This step is performed with the ``process`` command :
::
radsex process --input-dir ./samples --output-file coverage_table.tsv --threads 16 --min-coverage 1
In this command, the ``--input-dir`` specifies the location of the samples folder, which is *samples* in our case. The ``--output-file`` parameter specifies the location of the output file (*i.e.* the table of coverage), which was set to *coverage_table.tsv* here.
radsex process --input-dir samples --output-file coverage_table.tsv --threads 8 --min-cov 1
The ``--input-dir`` parameter specifies the location of the demultiplexed reads directory, which is *samples* in our case. The ``--output-file`` parameter specifies the location of the output file (*i.e.* the table of coverage), which is *coverage_table.tsv* here. This step can be parallelized using the ``--threads`` parameters, which we set to *8* in our example; you should adjust this value based on your computer's specifications. Finally, the ``--min-cov`` parameter specifies the minimum coverage for a marker to be considered present in an individual; if a marker has a coverage lower than the value of ``--min-cov`` in every individual, it will not be retained in the coverage table.
In this example, demultiplexed reads are stored in **./samples** and the coverage table generated by ``process`` will be stored in **coverage_table.tsv**. The parameter ``--threads`` specifies the number of threads to use, and ``--min-coverage`` specifies the minimum coverage to consider a marker present in an individual: markers which are not present with coverage higher than this value in at least one individual will not be retained in the coverage table.
It is advised to keep the minimum coverage to 1 for this step, as it can be adjusted for each analysis later.
The resulting file *coverage_table.tsv* will be used as a base for all analyses implemented in ``radsex``, but it is not used for any ``radsex-vis`` plots. For more information about this file, check the :ref:`coverage-table-file` section.
The resulting file **coverage_table.tsv** is a table with N + 2 columns, where *N* is the number of individuals in the dataset :
.. note:: In most cases, we advise to keep the value of ``--min-cov`` to 1 in order to retain all the information from the dataset in this step. Filtering for minimum coverage should be done in the following analysis steps, and it will be easier to try several minimum coverage values this way. If you are certain that all individuals in your dataset have high coverage, and you do not plan to run analyses with a minimum coverage of 1, you can increase this threshold.
* **ID** : marker ID.
* **Sequence** : marker sequence.
* For each individual, the coverage of this marker.
With our setup, using 8 cores, this step completed in **9 min 25 seconds** with a peak memory usage of **10.3 GB**. The resulting coverage table used 5.1 GB of disk space.
Computing the distribution of markers between sexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------------------------------
After generating the coverage table, the ``distrib`` command is used to compute the distribution of markers between sexes:
The main analysis implemented in ``radsex`` computes a table summarizing the distribution of all markers between males and females. This analysis is performed with the ``distrib`` command:
::
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-coverage 5``
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-cov 5``
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` section, and the distribution of markers between sexes will be stored in **distribution.tsv**.
The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details).
The minimum coverage to consider a marker present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual.
The ``--input-file`` parameter specifies the location of the coverage table generated in the previous step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, *i.e* the table of distribution of markers between males and females, which is *distribution.tsv* here. The ``--popmap`` parameter specifies the location of the population map (see the xx section for details), which we named *popmap.tsv* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* here.
The resulting file **distribution.tsv** is a table with five columns:
With our setup, this step completed in **36 seconds** with a peak memory usage of **4 Mb**.
* **Males** : number of males in which a marker was present.
* **Females** : number of females in which a marker was present.
* **Markers** : number of markers present in the corresponding number of males and females.
* **P** : p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
The resulting file *distribution.tsv* is a tabulated file described in the xxx section. This file can be visualized with ``radsex-vis`` using the ``plot_sex_distribution`` function:
::
radsex-vis::plot_sex_distribution(input_file_path = "distribution.tsv", output_file_path = "distribution.png")
To generate a basic plot, the only required parameter is ``input_file_path``, which should contain the full path to the distribution table (simplified in this example). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. For a full description of the ``plot_sex_distribution`` function, including additional parameters, check the TODO_SECTION.
The resulting figure is displayed below:
.. image:: ../example/figures/distribution.png
More details about the distribution file can be found in the [TODO SECTION].
This figure is a tile plot with number of males on the x-axis and number of females on the y-axis. The color of a tile at coordinates (**x**, **y**) indicates the number of markers that were present in any **x** males and any **y** females. For instance, in this figure, there were between 25 and 99 markers found in 29 males (not necessarily always the same 29 males) and in 0 females. Tiles for which association with sex is significant (chi-squared test, using Bonferroni correction) are highlighted in red. Many markers found predominantly in males are significantly associated with sex, indicating that an XX/XY system determines sex in this species. Interestingly, there are no markers found in all males and absent from all females, *i.e* no markers found at positions
This distribution can be visualized with the ``plot_sex_distribution()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap of markers with males on the x-axis and females on the y-axis.
Extracting markers significantly associated with sex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------------------------
Markers significantly associated with sex can be obtained with the ``signif`` command:
::
radsex signif --input-file coverage_table.tsv --output-file markers.tsv --popmap-file popmap.tsv --min-coverage 5 [ --output-format fasta ]
radsex signif --input-file coverage_table.tsv --output-file markers.tsv --popmap-file popmap.tsv --min-cov 5 [ --output-format fasta ]
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, and the markers significantly associated with sex are outputed in **markers.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details), and the minimum coverage to consider a sequence present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual.
......@@ -90,13 +89,13 @@ The coverage table generated by ``signif`` can be visualized with the ``plot_cov
Mapping markers to a reference genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-------------------------------------
The markers can be mapped to a reference genome using the ``map`` command:
::
radsex map --input-file coverage_table.tsv --output-file mapping.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-coverage 5
radsex map --input-file coverage_table.tsv --output-file mapping.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-cov 5
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, the mapping results will be stored in **sequences.tsv**, and the path to the reference genome file is given by ``--genome-file``. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details), and the minimum coverage to consider a marker present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual. The parameter ``--min-quality`` specifies the minimum mapping quality (as defined in `BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_) to consider a marker properly mapped, and is set to 20 in this example. The parameter ``--min-frequency`` specifies the minimum frequency of a marker in at least one sex; it is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of one sex are retained for mapping.
......
......@@ -21,6 +21,7 @@ Documentation summary
:maxdepth: 2
getting_started
example
usage
input_files
output_files
......
......@@ -6,6 +6,8 @@ Reads files
RADSex accepts demultiplexed reads files as first input. RADSex should work with any demultiplexed RAD-sequencing reads files regardless of technology (single / double digest) or enzyme. Input files can be in fasta or fastq formats, and can be compressed. RADSex uses file extensions to detect input files, and supports the following extensions: **.fa**, **.fa.gz**, **.fq**, **.fq.gz**, **.fasta**, **.fasta.gz**, **.fastq**, **.fastq.gz**, **.fna**, and **.fna.gz**.
.. _population-map:
Population map
--------------
......@@ -19,8 +21,8 @@ A population map file is a tabulated file (TSV, tab as a separator) without head
individual_4 N
individual_5 F
Individual IDs can be anything, but it is important that they correspond to the name of the demultiplexed files.
For instance, the reads file for *individual_1* should be named `individual_1.fastq.gz` (in any format supported by your demultiplexer).
Individual IDs can be anything, but it is important that they correspond to the name of the demultiplexed files.
For instance, the reads file for *individual_1* should be named `individual_1.fastq.gz` (or any fasta/fastq format supported by your demultiplexer).
If you are using Stacks with a barcodes file for demultiplexing, just make sure that individual IDs in the barcodes file and in the population map are the same.
......@@ -28,8 +30,8 @@ If you are using Stacks with a barcodes file for demultiplexing, just make sure
Chromosomes names file
----------------------
Genome-wide results from the ``map`` command are visualized using the ``plot_genome()`` function of ``radsex-vis``.
This function can automatically detect chromosomes in the reference file if their name starts with 'LG' or 'chr' (case unsensitive). If this is not the case, you should provide a chromosomes names file to ``plot_genome()``.
Genome-wide results from the ``map`` command are visualized using the ``plot_genome()`` function of ``radsex-vis``.
This function can automatically detect chromosomes in the reference file if their name starts with 'LG' or 'chr' (case unsensitive). If this is not the case, you should provide a chromosomes names file to ``plot_genome()``.
This file should be a tabulated file without header, with scaffold ID in the reference in the first column and corresponding chromosome name in the second column.
An example of chromosomes names file is given below for the `Northern Pike <https://www.ncbi.nlm.nih.gov/genome/?term=esox%20lucius>`_ genome.
......@@ -62,4 +64,4 @@ An example of chromosomes names file is given below for the `Northern Pike <http
NC_025991.3 LG24
NC_025992.3 LG25
.. note:: Any scaffold included in the chromosomes names file will be considered a chromosome to be plotted as a sector. In most NCBI genomes, mitochondria are also named NC_XXX. As mitochondria are usually too small to be included as a sector in the circos plot, you should not add them to the chromosomes names file.
\ No newline at end of file
.. note:: Any scaffold included in the chromosomes names file will be considered a chromosome to be plotted as a sector. In most NCBI genomes, mitochondria are also named NC_XXX. As mitochondria are usually too small to be included as a sector in the circos plot, you should not add them to the chromosomes names file.
Output file formats
===================
.. _coverage-table-file:
Coverage table files
--------------------
Coverage tables are tabulated files with header generated by the ``process`` command for the entire dataset, and by the ``subset`` and ``signif`` commands for a subset of sequences. The first column contains the marker ID, and the second column contains the sequence itself. Each other column contains the coverage of the corresponding marker in a given individual.
An example of coverage table is given below (the sequence was shortened for visual reasons):
Coverage tables are tabulated files with header generated by the ``process`` command for the entire dataset, and by the ``subset`` and ``signif`` commands for a subset of sequences. The first column contains the marker ID, and the second column contains the sequence itself. Each other column contains the coverage of the corresponding marker in a given individual.
An example of coverage table is given below (sequences were shortened for practical reasons):
::
......@@ -16,13 +18,15 @@ An example of coverage table is given below (the sequence was shortened for visu
3 TGCA..CCGA 14 29 23 2 19
.. _sex-distribution-file:
Distribution of markers between sexes
---------------------------------------
**Table format**
A table of distribution of markers between sexes is a tabulated file with header generated by the ``distrib`` command.
The first and second columns indicate the number of males and females in which a marker is present, the third column contains the number of markers found in the corresponding number of males and females, the fourth column contains the p-value of a chi-squared test for association with sex on the number of males and females, and the fifth column indicates whether this p-value is significant after Bonferroni correction.
A table of distribution of markers between sexes is a tabulated file with header generated by the ``distrib`` command.
The first and second columns indicate the number of males and females in which a marker is present, the third column contains the number of markers found in the corresponding number of males and females, the fourth column contains the p-value of a chi-squared test for association with sex on the number of males and females, and the fifth column indicates whether this p-value is significant after Bonferroni correction.
An example of sex distribution table is given below for 3 males and 3 females:
......@@ -50,7 +54,7 @@ In this example, there are 68 sequences in total, therefore sequences are signif
**Matrix format**
The distribution of sequences between sexes can also be output as a matrix, which is a tabulated file without header, with number of females as rows and number of males as rows.
The distribution of sequences between sexes can also be output as a matrix, which is a tabulated file without header, with number of females as rows and number of males as rows.
The sex distribution matrix for the example described above is given below:
::
......@@ -61,6 +65,8 @@ The sex distribution matrix for the example described above is given below:
1 2 2 9
.. _fasta-file:
Fasta files
-----------
......@@ -75,13 +81,15 @@ In the ``signif`` analysis, another field containing the p-value of association
``<ID>_<number of males>M_<number of females>F_cov:<minimum coverage>_p:<p-value>``
.. _mapping-results-file:
Mapping results
---------------
Results from the ``map`` command are output as a tabulated file with header.
The first column contains the sequence ID, the second column contains the contig to which the sequence mapped in the reference genome, and the third columns contains the position where the sequence mapped on the contig.
The fourth column contains a sex-bias value, defined as `(number of males with the sequence) / (total number of males) - (number of females with the sequence) / (total number of females)`.
The fifth column contains the p-value of a chi-squared test for association with sex, and the sixth column indicates whether this p-value is significant after Bonferroni correction.
Results from the ``map`` command are output as a tabulated file with header.
The first column contains the sequence ID, the second column contains the contig to which the sequence mapped in the reference genome, and the third columns contains the position where the sequence mapped on the contig.
The fourth column contains a sex-bias value, defined as `(number of males with the sequence) / (total number of males) - (number of females with the sequence) / (total number of females)`.
The fifth column contains the p-value of a chi-squared test for association with sex, and the sixth column indicates whether this p-value is significant after Bonferroni correction.
An example of mapping results is given below:
::
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment