Commit 6d0d5c93 authored by Romain Feron's avatar Romain Feron
Browse files

Removed sphinx html output

parent 27689d14
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 9c713b4ae7090151a66ad61a77657cf4
tags: 645f666f9bcd5a90fca523b33c5a78b7
Example walkthrough
===================
In this example, we will run RADSex on a public *Oryzias latipes* RAD-Sequencing dataset. We will detail each step of the process, highlight important details, and show how to use the R package ``radsex-vis`` to generate plots from the output of ``radsex``. This guide assumes that ``radsex`` and the ``radsex-vis`` package have already been installed. For specific instruction about installing ``radsex`` and ``radsex-vis``, check the :ref:`install-release` section. All reported times and resources usage were measured on a desktop computer with an Intel i7-8700K 4.7 GHz processor, 32 Gb of memory, and a standard 7200 RPM Hard Disk Drive. The input data, results (except for ``process``), and figures can be found in the *example* directory.
Preparing the data
------------------
The RAD-Sequencing datasets used in this example are available on the Sequence Read Archive on NCBI. The reads were demultiplexed before being deposited on NCBI, and samples were grouped in two projects, males and females. The accession number for **female** samples is **SRS662264**, and the accession number for **male** samples is **SRS662265**. For convenience, simple scripts to download male and female samples from the EBI ftp can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_female_samples.sh>`__ for female samples and `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_male_samples.sh>`_ for male samples. This dataset was published in `Wilson et al 2014 <http://www.genetics.org/content/early/2014/09/18/genetics.114.169284>`__.
The population map describing the sex of each samples for this dataset can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/population_map.tsv>`__. The genome used for mapping markers with `radsex map` was that of a HSOK strain, NCBI accession number **GCA_002234695.1** (`link <https://www.ncbi.nlm.nih.gov/assembly/GCA_002234695.1>`_). The chromosomes names file used to display chromosomes with nicer names in the genome mapping plot can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/chromosomes_names.tsv>`__.
.. note:: RADSex uses file names to generate individual IDs. Therefore, individual names in the population map should correspond to the file names without their extensions. Check the file names and population map provided above for an example of how to build the population map from file names. More details about the population map can be found in the :ref:`population-map` section.
From now on, we will assume the following directory structure:
::
.
├─── samples
| ├────── xxx.fastq.gz
| ├────── xxx.fastq.gz
| ├────── ...
| └────── xxx.fastq.gz
├─── chromosomes_names.tsv
├─── genome.fasta
└─── popmap.tsv
Generating a coverage table for the entire dataset
--------------------------------------------------
The first step of RADSex is to create a table containing the coverage of each marker in each individual for the entire dataset; a marker is defined as a non-polymorphic sequence (no mismatches or SNPs). This step is performed with the ``process`` command :
::
radsex process --input-dir samples --output-file coverage_table.tsv --threads 8 --min-cov 1
The ``--input-dir`` parameter specifies the location of the demultiplexed reads directory, which is *samples* in our case. The ``--output-file`` parameter specifies the location of the output file (*i.e.* the table of coverage), which is *coverage_table.tsv* here. This step can be parallelized using the ``--threads`` parameters, which we set to *8* in our example; you should adjust this value based on your computer's specifications. Finally, the ``--min-cov`` parameter specifies the minimum coverage for a marker to be considered present in an individual; if a marker has a coverage lower than the value of ``--min-cov`` in every individual, it will not be retained in the coverage table.
The resulting file *coverage_table.tsv* will be used as a base for all analyses implemented in ``radsex``, but it is not used for any ``radsex-vis`` plots. For more information about this file, check the :ref:`coverage-table-file` section.
.. note:: In most cases, we advise to keep the value of ``--min-cov`` to 1 in order to retain all the information from the dataset in this step. Filtering for minimum coverage should be done in the following analysis steps, and it will be easier to try several minimum coverage values this way. If you are certain that all individuals in your dataset have high coverage, and you do not plan to run analyses with a minimum coverage of 1, you can increase this threshold.
With our setup, using 8 cores, this step completed in **9 min 25 seconds** with a peak memory usage of **10.3 GB**. The resulting coverage table used 5.1 GB of disk space.
Computing the distribution of markers between sexes
---------------------------------------------------
The main analysis implemented in ``radsex`` computes a table summarizing the distribution of all markers between males and females. This analysis is performed with the ``distrib`` command:
::
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-cov 5``
The ``--input-file`` parameter specifies the location of the coverage table generated in the previous step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, *i.e* the table of distribution of markers between males and females, which is *distribution.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the :ref:`population-map` section for details), which we named *popmap.tsv* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* here.
With our setup, this step completed in **36 seconds** with a peak memory usage of **4 Mb**.
The resulting file *distribution.tsv* is a tabulated file described in the :ref:`population-map` section. This file can be visualized with ``radsex-vis`` using the ``plot_sex_distribution`` function:
::
radsexvis::plot_sex_distribution("distribution.tsv", output_file_path = "distribution.png")
To generate a basic plot, the only required parameter is the full path to a distribution table (simplified as "distribution.tsv" in this example). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. For a full description of the ``plot_sex_distribution()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../../example/figures/distribution.png
This figure is a tile plot with number of males on the x-axis and number of females on the y-axis. The color of a tile at coordinates (**x**, **y**) indicates the number of markers that were present in any **x** males and any **y** females. For instance, in this figure, there were between 25 and 99 markers found in 29 males (not necessarily always the same 29 males) and in 0 females. Tiles for which association with sex is significant (chi-squared test, using Bonferroni correction) are highlighted in red. Many markers found predominantly in males are significantly associated with sex, indicating that an XX/XY system determines sex in this species. Interestingly, there are no markers found in all males or all but one males and absent from all females, *i.e* no markers found at position (30, 0) and (31, 0).
With our setup, this step completed in **36 seconds** with a peak memory usage of **4 MB**.
Extracting markers significantly associated with sex
----------------------------------------------------
The ``signif`` command of RADSex extracts all markers for which association with sex is significant. In this case, these markers are the ones represented by the tiles highlighted in red in the previous figure. To extract all significant markers from our dataset, run the following command :
::
radsex signif --input-file coverage_table.tsv --output-file significant_markers.tsv --popmap-file popmap.tsv --min-cov 5
The ``--input-file`` parameter specifies the location of the coverage table generated in the ``process`` step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, in this case a subset of the table of coverage, which we named *significant_markers.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the xx section for details), which we named *popmap.tsv* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* to match the value used in the previous analysis.
The subset of coverage table generated by ``signif`` can be visualized with ``radsex-vis`` the ``plot_coverage()`` function :
::
radsexvis::plot_coverage("significant_markers.tsv", output_file_path = "significant_markers.png", popmap_file_path = "popmap.tsv")
To generate a basic plot, the only required parameter is the full path to the subset of coverage table (simplified as "significant_markers.tsv" in this example). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. The ``popmap_file_path`` parameter can be specified to color males and females IDs in the resulting figure. For a full description of the ``plot_coverage()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../../example/figures/significant_markers.png
This figure is a heatmap with individuals on the x-axis and markers on the y-axis. The color of a tile at coordinates (**x**, **y**) indicates the coverage of a marker **y** in individual **x**. Both individuals and markers can be clustered based on this coverage, and clustering dendrograms are displayed by default. If a popmap is specified, males and females IDs are colored differently. In this example, two males cluster with the females, in agreement with the results from ``distrib`` where male-specific markers were always missing from two males. These two males are actually genetic females whose sex was mis-assigned.
.. note:: For convenience, significant markers can be exported in FASTA format, using the parameter --output-format fasta. Headers contain information about the sex distribution of each marker, as described in the :ref:`fasta-file` section.
With our setup, this step completed in **37 seconds** with a peak memory usage of **6 MB**.
Mapping markers to a reference genome
-------------------------------------
When a reference genome is available, markers can be aligned to it in order to locate sex-differentiated regions. This is done using the ``map`` command:
::
radsex map --input-file coverage_table.tsv --output-file mapping_results.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-cov 5
The ``--input-file`` parameter specifies the location of the coverage table generated in the ``process`` step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, in this case a table with mapping information, which we named *mapping_results.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the xx section for details), which we named *popmap.tsv* in this example. The ``--genome-file`` parameter specifies the location of reference genome file in FASTA format, which we named *genome.fasta* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* to match the value used in the previous analysis.
The resulting file *mapping_results.tsv* is a tabulated file described in the :ref:`mapping-results-file` section. This file can be visualized with ``radsex-vis`` using the ``plot_genome()`` function:
::
radsexvis::plot_genome("mapping_results.tsv", "genome.fasta.lengths", chromosomes_names_file_path = "chromosomes_names.tsv", output_file_path = "mapping_genome.png")
To generate a basic plot, the only required parameters are the full path to the mapping results table (simplified as "mapping_results.tsv" in this example), and the full path to the genome contig lengths generated by ``map`` ("genome.fasta.lengths" here). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. The ``chromosomes_names_file_path`` parameter can be specified to rename the chromosomes with chosen IDs specified in the file. For a full description of the ``plot_genome()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../../example/figures/mapping_genome.png
This figure is a circos plot in which each sector corresponds to a chromosome, with all unplaced scaffolds regrouped in an additional sector (not shown in this example as there are no unplaced scaffolds in this genome). The top track gives the sex-bias of a marker, 1 if the marker is present in all males and no females, and -1 if the marker is present in all females and no males. The bottom track shows the probability of association with sex (chi-squared test, using Bonferroni correction).
Results for a specific region can be visualized with ``radsex-vis`` using the ``plot_contig()`` function:
::
radsexvis::plot_contig("mapping_results.tsv", "genome.fasta.lengths", "Chr01", chromosomes_names_file_path = "chromosomes_names.tsv", output_file_path = "mapping_contig.png")
This function uses the same parameters as ``plot_genome()``, with the addition of a parameter giving the contig to be plotted, *Chr01* here. For a full description of the ``plot_contig()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../../example/figures/mapping_contig.png
In this figure, both sex-bias and probability of association with sex, as defined in the genome plot, are plotted against position on the specified contig.
With our setup, this step completed in **9 min 36 seconds** with a peak memory usage of **1.3 GB**, most of the time being spent indexing the genome. If the genome is already indexed with BWA, this step completes in **55 seconds**.
Going further
-------------
In this example, we showed the most commonly used functions of ``radsex`` and ``radsex-vis``, mostly using default parameters. In general, it is recommended to run ``distrib`` with multiple values of coverage (for instance 1, 2, 5, and 10), to better understand the dataset.
To get the full information on each function of ``radsex``, check the :ref:`full-usage` section.
Getting started
===============
Installation
------------
Requirements
~~~~~~~~~~~~
* A C++11 compliant compiler (GCC >= 4.8.1, Clang >= 3.3)
* The zlib library (usually installed on linux by default)
.. _install-release:
Installation
~~~~~~~~~~~~
RADSex can be installed from one of the `release packages <https://github.com/RomainFeron/RadSex/releases>`_.
The latest stable development version can be installed directly from the GitHub repository.
**1. Install the latest release**
* Download the latest release from `GitHub <https://github.com/RomainFeron/RadSex/releases>`_
* Unzip the archive
* Navigate to the `RADSex` directory
* Run ``make``
**2. Install the latest stable development version**
To install the latest stable version of RADSex directly from the GitHub repository, run the following commands:
::
git clone https://github.com/RomainFeron/RADSex.git
cd RADSex
make
The compiled ``radsex`` binary will be located in **RADSex/bin/**.
Update RADSex
~~~~~~~~~~~~~
To update RADSex, you can download the latest stable release and install it as described in the :ref:`install-release` section.
If you installed RADSex directly from the GitHub repository, update RADSex by running the following commands from the **RADSex** directory:
::
git pull
make rebuild
Before starting
---------------
Before running the pipeline, you should prepare the following files:
* A **set of demultiplexed reads**. The current version of RADSex does not implement demultiplexing. Raw sequencing reads can be demultiplexed using `Stacks <http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php>`_ or `pyRAD <http://nbviewer.jupyter.org/gist/dereneaton/af9548ea0e94bff99aa0/pyRAD_v.3.0.ipynb#The-seven-steps-described>`_.
* A **group information file (popmap)**: a tabulated file with individual ID as the first column and sex as the second column. It is important that the individual IDs in the popmap are the same as the names of the demultiplexed reads files (see the [popmap section](#population-map)).
* To align the sequences to a genome: the **genome file** in fasta format.
.. note:: When visualizing ``map`` results with ``radsex-vis``, linkage groups / chromosomes are automatically inferred from scaffold names in the reference sequence if their name starts with *LG*, *CHR*, or *NC* (case unsensitive). If chromosomes are named differently in the reference genome, you should prepare a tabulated file with reference contig ID in the first column and corresponding chromosome name in the second column (see the [chromosomes names section](#chromosomes-names)).
Running RADSex
--------------
.. _computing-depth-table:
Computing the markers depth table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The first step of RADSex is to create a table of marker depths for the dataset using the ``process`` command:
::
radsex process --input-dir ./samples --output-file markers_table.tsv --threads 16 --min-depth 1
In this example, demultiplexed reads are stored in **./samples** and the markers table generated by ``process`` will be stored in **markers_table.tsv**. The parameter ``--threads`` specifies the number of threads to use, and ``--min-depth`` specifies the minimum depth to consider a marker present in an individual: markers which are not present with depth higher than this value in at least one individual will not be retained in the markers table.
It is advised to keep the minimum depth to 1 (default value) for this step, as it can be adjusted for each analysis later.
The resulting file **markers_table.tsv** is a table with N + 2 columns, where *N* is the number of individuals in the dataset :
* **ID** : marker ID.
* **Sequence** : marker sequence.
* For each individual, the depth of this marker in this individual.
Computing the distribution of markers between sexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After generating the markers depth table, the ``distrib`` command computes the distribution of markers between groups:
::
radsex distrib --markers-table markers_table.tsv --output-file distribution.tsv --popmap popmap.tsv --min-depth 5``
In this example, ``--markers-table`` is the table generated in the :ref:`computing-depth-table` section, and the distribution of markers between groups will be stored in **distribution.tsv**.
The group (here, the sex) of each individual in the population is given by **popmap.tsv** (see the :ref:`population-map` section).
The minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual.
The resulting file **distribution.tsv** is a table with six columns:
* **Males** : number of males in which a marker was present.
* **Females** : number of females in which a marker was present.
* **Markers** : number of markers present in the corresponding number of males and females.
* **P** : p-value of a chi-squared test for association with sex.
* **Signif** : significant association with sex (True / False).
* **Bias** : sex-bias of a marker [-1, 1].
More details about the distribution file can be found in the :ref:`sex-distribution-file` section.
This distribution can be visualized with the ``plot_sex_distribution()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a tile plot of marker counts with number of males on the x-axis and number of females on the y-axis.
Extracting markers significantly associated with sex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Markers significantly associated with sex can be obtained with the ``signif`` command:
::
radsex signif --markers-table markers_table.tsv --output-file markers.tsv --popmap popmap.tsv --min-depth 5 [ --output-fasta ]
In this example, ``--markers-table`` is the table generated in the :ref:`computing-depth-table` section, and the markers significantly associated with sex are output to **markers.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the :ref:`population-map` section).
The minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual.
By default, the ``signif`` function generates an output file in the same format as the markers depth table. Markers can also be exported to a fasta file using the ``--output-fasta`` parameter (see the :ref:`fasta-file` section).
The markers table generated by ``signif`` can be visualized with the ``plot_depth()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a heatmap showing the depth of each marker in each individual.
Aligning markers to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Markers can be aligned to a reference genome using the ``map`` command:
::
radsex map --markers-file markers_table.tsv --output-file mapping.tsv --popmap popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-depth 5
In this example, ``--markers-file`` is the markers depth table generated in the :ref:`computing-depth-table` step, and the path to the reference genome file is given by ``--genome-file``; results will be stored in **sequences.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the :ref:`population-map` section), and the minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual.
The parameter ``--min-quality`` specifies the minimum mapping quality (as defined in `BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_) to consider a marker properly aligned and is set to 20 in this example. The parameter ``--min-frequency`` specifies the minimum frequency of a marker in the population to retain this marker and is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of the population are aligned to the genome.
The resulting file ``mapping.tsv`` is a table with seven columns:
* **Contig :** name of the contig to which the marker was aligned.
* **Position :** position where the marker was aligned on the contig.
* **Length :** length of the contig to which the marker was aligned.
* **Marker_ID :** ID of the marker in the markers depth table.
* **Bias :** bias of the marker (see below).
* **P :** p-value of a chi-squared test for association with sex.
* **Signif** : *True* if the marker is significantly associated with sex, *False* otherwise.
The **bias** of a marker is defined as (Males / Total males ) - (Females / Total females), where *Males* and *Females* are the number of males and number of females in which the marker is present, and *Total males* and *Total females* are the total number of males and females in the population.
The results generated by ``map`` can be visualized with the ``plot_genome()`` function of `RADSex-vis <https://github.com/RomainFeron/RADSex-vis>`_, which generates a circular plot showing bias and association with sex for each marker aligned to the genome.
Mapping results for a specific contig can be visualized with the ``plot_contig()`` function to show the same metrics for a single contig.
.. RADSex documentation master file, created by
sphinx-quickstart on Thu Sep 13 15:17:16 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
RADSex documentation
====================
RADSex is a software package to analyze RAD-Sequencing data. It is primarily designed to look for sex signal, but it can be used to compare any two populations.
The core idea of RADSex is to compare presence / absence of non-polymorphic markers between individuals in two populations. RADSex does not allow mismatches when grouping reads into markers. This means that each allele in a polyallelic gene is represented as a separate marker, whereas other RAD-Sequencing analysis softwares would usually group these alleles in a single polymorphic marker. Separating alleles from polymorphic markers enables RADSex to easily detect sex-specific alleles, using only minimum coverage of a marker as a parameter.
The basic input file of RADSex is a dataset of demultiplexed RAD reads. From this dataset, RADSex generates a table of coverage for each sequence in each individual. This table is then used to infer information about the type of sex-determination system, identify sex-biased sequences, map the RAD sequences to a reference genome, and recover potential polymorphic sex-biased markers.
Results from RADSex can be visualized with the `radsex-vis` R package, available here: https://github.com/INRA-LPGP/radsex-vis. The `radsex-vis` R package provides easy-to-use functions to generate visual representations of your data.
RADSex's API documentation generated with Doxygen is available `here <../../../doxygen/html/index.html>`_
Documentation summary
---------------------
.. toctree::
:maxdepth: 2
getting_started
example
usage
input_files
output_files
license
Input file formats
==================
Reads files
-----------
RADSex accepts demultiplexed reads files as first input. RADSex should work with any demultiplexed RAD-sequencing reads files regardless of technology (single / double digest) or enzyme. Input files can be in fasta or fastq formats, and can be compressed. RADSex uses file extensions to detect input files, and supports the following extensions: **.fa**, **.fa.gz**, **.fq**, **.fq.gz**, **.fasta**, **.fasta.gz**, **.fastq**, **.fastq.gz**, **.fna**, and **.fna.gz**.
.. _population-map:
Population map
--------------
A population map file is a tabulated file (TSV, tab as a separator) without header, with individual ID in the first column and sex in the second column. Sex is encoded as **"M"** for males, **"F"** for females, and **"N"** for undetermined. An example of population map is given below:
::
individual_1 M
individual_2 M
individual_3 F
individual_4 N
individual_5 F
Individual IDs can be anything, but it is important that they correspond to the name of the demultiplexed files.
For instance, the reads file for *individual_1* should be named `individual_1.fastq.gz` (or any fasta/fastq format supported by your demultiplexer).
If you are using Stacks with a barcodes file for demultiplexing, just make sure that individual IDs in the barcodes file and in the population map are the same.
Chromosomes names file
----------------------
Genome-wide results from the ``map`` command are visualized using the ``plot_genome()`` function of ``radsex-vis``.
This function can automatically detect chromosomes in the reference file if their name starts with 'LG' or 'chr' (case unsensitive). If this is not the case, you should provide a chromosomes names file to ``plot_genome()``.
This file should be a tabulated file without header, with scaffold ID in the reference in the first column and corresponding chromosome name in the second column.
An example of chromosomes names file is given below for the `Northern Pike <https://www.ncbi.nlm.nih.gov/genome/?term=esox%20lucius>`_ genome.
::
NC_025968.3 LG01
NC_025969.3 LG02
NC_025970.3 LG03
NC_025971.3 LG04
NC_025972.3 LG05
NC_025973.3 LG06
NC_025974.3 LG07
NC_025975.3 LG08
NC_025976.3 LG09
NC_025977.3 LG10
NC_025978.3 LG11
NC_025979.3 LG12
NC_025980.3 LG13
NC_025981.3 LG14
NC_025982.3 LG15
NC_025983.3 LG16
NC_025984.3 LG17
NC_025985.3 LG18
NC_025986.3 LG19
NC_025987.3 LG20
NC_025988.3 LG21
NC_025989.3 LG22
NC_025990.3 LG23
NC_025991.3 LG24
NC_025992.3 LG25
.. note:: Any scaffold included in the chromosomes names file will be considered a chromosome to be plotted as a sector. In most NCBI genomes, mitochondria are also named NC_XXX. As mitochondria are usually too small to be included as a sector in the circos plot, you should not add them to the chromosomes names file.
This diff is collapsed.
Output file formats
===================
.. _markers-depths-table-file:
Markers depth table
-------------------
Coverage tables are tabulated files with header generated by the ``process`` command for the entire dataset, and by the ``subset`` and ``signif`` commands for a subset of sequences. The first column contains the marker ID, and the second column contains the sequence itself. Each other column contains the coverage of the corresponding marker in a given individual.
An example of coverage table is given below (sequences were shortened for practical reasons):
::
ID Sequence individual_1 individual_2 individual_3 individual_4 individual_5
0 TGCA..TATT 0 15 24 17 21
1 TGCA..GACC 20 18 3 26 4
2 TGCA..ATCG 2 1 5 16 0
3 TGCA..CCGA 14 29 23 2 19
.. _sex-distribution-file:
Distribution of markers between groups
--------------------------------------
**Table format**
A table of distribution of markers between sexes is a tabulated file with header generated by the ``distrib`` command.
The first and second columns indicate the number of males and females in which a marker is present, the third column contains the number of markers found in the corresponding number of males and females, the fourth column contains the p-value of a chi-squared test for association with sex on the number of males and females, and the fifth column indicates whether this p-value is significant after Bonferroni correction.
An example of sex distribution table is given below for 3 males and 3 females:
::
Males Females Sequences P Signif
0 1 7 1 False
0 2 3 0.39 False
0 3 1 0.10 False
1 0 6 1 False
1 1 5 1 False
1 2 1 1 False
1 3 2 0.39 False
2 0 3 0.39 False
2 1 8 1 False
2 2 4 1 False
2 3 2 1 False
3 0 4 0.10 False
3 1 7 0.39 False
3 2 6 1 False
3 3 9 1 False
In this example, there are 68 sequences in total, therefore sequences are significantly associated with sex if the p-value of a chi-squared test on the number of males and females is lower than 0.05 / 68 = 0.00074 (Bonferroni correction).
**Matrix format**
The distribution of sequences between sexes can also be output as a matrix, which is a tabulated file without header, with number of females as rows and number of males as rows.
The sex distribution matrix for the example described above is given below:
::
0 6 3 4
7 5 8 7
3 1 4 6
1 2 2 9
.. _fasta-file:
Fasta files
-----------
FASTA files are be generated by the ``subset`` and ``signif`` commands for a subset of sequences, if the ``--output-format`` parameter is set to *fasta*.
In the ``subset`` analysis, FASTA headers are generated with the following pattern:
``<ID>_<number of males>M_<number of females>F_cov:<minimum coverage>``
In the ``signif`` analysis, another field containing the p-value of association with sex is added:
``<ID>_<number of males>M_<number of females>F_cov:<minimum coverage>_p:<p-value>``
.. _mapping-results-file:
Alignment results
-----------------
Results from the ``map`` command are output as a tabulated file with header.
The first column contains the sequence ID, the second column contains the contig to which the sequence mapped in the reference genome, and the third columns contains the position where the sequence mapped on the contig.
The fourth column contains a sex-bias value, defined as `(number of males with the sequence) / (total number of males) - (number of females with the sequence) / (total number of females)`.
The fifth column contains the p-value of a chi-squared test for association with sex, and the sixth column indicates whether this p-value is significant after Bonferroni correction.
An example of mapping results is given below:
::
Marker Contig Position SexBias P Signif
0 LG09 10052920 0 1 False
1 LG45 4008419 0 1 False
2 LG06 20521435 0 1 False
3 LG24 7643946 0.13 0.44 False
4 LG06 16975491 0 1 False
5 LG27 16580048 0 1 False
6 LG49 7206356 0.03 1 False
7 LG30 5571989 0 1 False
8 LG05 20094761 0 1 False
9 LG14 20088495 0 1 False
10 LG34 11566459 -0.04 1 False
11 LG21 17338149 0 1 False
12 LG05 14652417 0.13 0.55 False
13 LG25 23851527 0.75 0.001 True
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment