Commit d43e742b authored by RomainFeron's avatar RomainFeron
Browse files

Started writing example doc

parent 51ab418e
Example: running RADSex on a public *Oryzias latipes* RAD-Sequencing dataset
============================================================================
Preparing the data
------------------
The RAD-Sequencing datasets used in this example are available on the Sequence Read Archive on NCBI. The reads were demultiplexed before being deposited on NCBI, and samples were grouped in two projects, males and females. The accession number for **female** samples is **SRS662264**, and the accession number for **male** samples is **SRS662265**. For convenience, females samples can be accessed `here <https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=2912074>`_ and male samples `here <https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=2912073>`_. This dataset was analyzed in `Wilson et al 2014 <http://www.genetics.org/content/early/2014/09/18/genetics.114.169284>`_.
The population map describing the sex of each samples for this dataset can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/population_map.tsv>`_.
The genome used for mapping markers with `radsex map` was that of a HSOK strain, NCBI accession number **GCA_002234695.1** (`link <https://www.ncbi.nlm.nih.gov/assembly/GCA_002234695.1>`_).
The chromosomes names file used to display chromosomes with nicer names in the genome mapping plot can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/chromosomes_names.tsv>`_.
From there, we will assume the following folder structure:
::
.
├─── samples
| ├────── xxx.fastq.gz
| ├────── xxx.fastq.gz
| ├────── ...
| └────── xxx.fastq.gz
└─── popmap.tsv
Computing the coverage table
--------------
The first step of RADSex is to create a table containing the coverage of each marker in the dataset in each individual, with a marker defined as a non-polymorphic sequence (no mismatches or SNPs). To do so, we use the ``process`` command as follows :
::
radsex process --input-dir ./samples --output-file coverage_table.tsv --threads 16 --min-coverage 1
In this command, the ``--input-dir`` specifies the location of the samples folder, which is *samples* in our case. The ``--output-file`` parameter specifies the location of the output file (*i.e.* the table of coverage), which was set to *coverage_table.tsv* here.
In this example, demultiplexed reads are stored in **./samples** and the coverage table generated by ``process`` will be stored in **coverage_table.tsv**. The parameter ``--threads`` specifies the number of threads to use, and ``--min-coverage`` specifies the minimum coverage to consider a marker present in an individual: markers which are not present with coverage higher than this value in at least one individual will not be retained in the coverage table.
It is advised to keep the minimum coverage to 1 for this step, as it can be adjusted for each analysis later.
The resulting file **coverage_table.tsv** is a table with N + 2 columns, where *N* is the number of individuals in the dataset :
* **ID** : marker ID.
* **Sequence** : marker sequence.
* For each individual, the coverage of this marker.
Computing the distribution of markers between sexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After generating the coverage table, the ``distrib`` command is used to compute the distribution of markers between sexes:
::
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-coverage 5``
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` section, and the distribution of markers between sexes will be stored in **distribution.tsv**.
The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details).
The minimum coverage to consider a marker present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual.
The resulting file **distribution.tsv** is a table with five columns:
* **Males** : number of males in which a marker was present.
* **Females** : number of females in which a marker was present.
* **Markers** : number of markers present in the corresponding number of males and females.
* **P** : p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
More details about the distribution file can be found in the [TODO SECTION].
This distribution can be visualized with the ``plot_sex_distribution()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap of markers with males on the x-axis and females on the y-axis.
Extracting markers significantly associated with sex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Markers significantly associated with sex can be obtained with the ``signif`` command:
::
radsex signif --input-file coverage_table.tsv --output-file markers.tsv --popmap-file popmap.tsv --min-coverage 5 [ --output-format fasta ]
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, and the markers significantly associated with sex are outputed in **markers.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details), and the minimum coverage to consider a sequence present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual.
By default, the ``signif`` function generates an output file in the same format as the coverage table. However, the sequences can be exported to fasta using the ``--output-format`` parameter (see TODO SECTION).
The coverage table generated by ``signif`` can be visualized with the ``plot_coverage()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap showing the coverage of each sequence in each individual.
Mapping markers to a reference genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The markers can be mapped to a reference genome using the ``map`` command:
::
radsex map --input-file coverage_table.tsv --output-file mapping.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-coverage 5
In this example, the input file ``--input-file`` is the coverage table generated in the :ref:`computing-cov-table` step, the mapping results will be stored in **sequences.tsv**, and the path to the reference genome file is given by ``--genome-file``. The sex of each individual in the population is given by **popmap.tsv** (see the [popmap section](#population-map) for details), and the minimum coverage to consider a marker present in an individual is set to 5, meaning that markers with coverage lower than 5 in an individual will not be considered present in this individual. The parameter ``--min-quality`` specifies the minimum mapping quality (as defined in `BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_) to consider a marker properly mapped, and is set to 20 in this example. The parameter ``--min-frequency`` specifies the minimum frequency of a marker in at least one sex; it is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of one sex are retained for mapping.
The resulting file ``mapping.tsv`` is a table with five columns:
* **Sequence :** ID of the mapped sequence.
* **Contig :** ID of the contig where the sequence mapped.
* **Position :** position of the mapped sequence on the contig.
* **SexBias :** sex-bias of the mapped sequence, defined as (Males / Total males ) - (Females / Total females), where *Males* and *Females* are the number of males and number of females in which the sequence is present, respectively, and *Total males* and *Total females* are the total number of males and females in the population, respectively.
* **P :** p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
The mapping results generated by ``map`` can be visualized with the ``plot_genome()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a circular plot with the sex-bias and association with sex of each marker mapped on the genome.
Mapping results for a specific contig can be visualized with the ``plot_scaffold()`` function to show the same metrics for a single contig.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment