Commit 8968b5fb authored by RomainFeron's avatar RomainFeron
Browse files

Merged branch dev for 1.0.0

parents bb588f77 e5cade64
......@@ -23,7 +23,6 @@ tags
.DS_Store
.directory
*.debug
Makefile*
*.prl
*.app
moc_*.cpp
......@@ -72,11 +71,10 @@ Thumbs.db
*.exe
# Personal
test/
tests*
bin/
build/
*.pro
*.pro.user
/include/*/*.o
/include/*/*.a
*/_build/*
callgrind*
# Compiler options
CC = g++
OPTCFLAGS = -O2
CFLAGS = -Wall -Wno-maybe-uninitialized -std=c++11 $(OPTCFLAGS)
LDFLAGS = -pthread -lstdc++ -lz -L$(INCLUDE)/bwa -lbwa
# Directory organisation
BASEDIR = .
BIN = $(BASEDIR)/bin
BUILD = $(BASEDIR)/build
INCLUDE = $(BASEDIR)/include
SRC = $(BASEDIR)/src
CPP = $(wildcard $(SRC)/*.cpp)
# Get number of parallel jobs
MAKE_PID := $(shell echo $$PPID)
JOBS := $(shell ps T | sed -n 's/.*$(MAKE_PID).*$(MAKE).* \(-j\|--jobs\) *\([0-9][0-9]*\).*/\2/p')
ifeq ($(JOBS),)
JOBS := 1
endif
# Object files inferred from cpp files
OBJS = $(addprefix $(BUILD)/, $(notdir $(CPP:.cpp=.o)))
# Target
TARGETS = $(BIN)/radsex
# Declare phony targets (i.e. targets which are not files)
.PHONY: all clean clean-bwa clean-kfun clean-all rebuild rebuild-all docs
# Main rule
all: $(BUILD) $(BIN) $(TARGETS)
# Build directory
$(BUILD):
if [ ! -e $@ ]; then \
mkdir $@; \
fi;
# Bin directory
$(BIN):
if [ ! -e $@ ]; then \
mkdir $@; \
fi;
# Build BWA
$(INCLUDE)/bwa/libbwa.a:
(cd $(INCLUDE)/bwa && $(MAKE) -j $(JOBS))
# Clean BWA
clean-bwa:
(cd $(INCLUDE)/bwa && $(MAKE) clean)
# Build kfun
$(INCLUDE)/kfun/kfun.o: $(INCLUDE)/kfun/kfun.cpp
$(CC) $(CFLAGS) -I $(INCLUDE) -c -o $@ $^
# Clean BWA
clean-kfun:
(rm $(INCLUDE)/kfun/kfun.o)
# Clean RADSex files
clean:
rm -rf $(BUILD)/*.o
rm -rf $(BIN)/*
# Clean all files
clean-all: clean clean-bwa clean-kfun
# Rebuild RADSex only
rebuild:
$(MAKE) clean
$(MAKE) -j $(JOBS)
# Rebuild RADSex and dependencies
rebuild-all:
$(MAKE) clean-all
$(MAKE) -j $(JOBS)
# Linking
$(BIN)/radsex: $(OBJS) $(INCLUDE)/kfun/kfun.o
$(CC) $(CFLAGS) -I $(INCLUDE) -o $(BIN)/radsex $^ $(LDFLAGS)
# Build a single object file. Added libs as dependency so they are built before object files
$(BUILD)/%.o: $(SRC)/%.cpp $(INCLUDE)/bwa/libbwa.a
$(CC) $(CFLAGS) -I $(INCLUDE) -c -o $@ $<
# Build doc with sphinx and doxygen
docs:
(cd docs && make)
clean-docs:
(cd docs && make clean)
rebuild-docs:
(cd docs && make rebuild)
[![DOI](https://zenodo.org/badge/86720601.svg)](https://zenodo.org/badge/latestdoi/86720601)
# RADSex
Current pre-release : 0.2.0
The RADSex pipeline is **currently under development** and has not been officially released yet.
Missing features are been implemented, some bugs are to be expected in this current development version, and parameters are subject to change.
Please contact me by email or on Github, or open an issue if you encounter bugs or would like to discuss a feature !
......@@ -7,103 +11,135 @@ Please contact me by email or on Github, or open an issue if you encounter bugs
## Overview
RADSex is a software package for the analysis of sex-determination using RAD-Sequencing data.
The `process` function generates a data structure summarizing a set of demultiplexed RAD reads,
and other functions use this data structure to infer information about the type of sex-determination system, identify sex-biased sequences, and map the RAD sequences to a reference genome.
The results of RADSex are meant to be visualized with the `radsex-vis` R package, available here: https://github.com/RomainFeron/RADSex-vis.
The `process` command generates a data structure summarizing a set of demultiplexed RAD reads,
and other commands use this data structure to:
- infer the type of sex-determination system
- identify sex-biased markers
- align markers to a genome and identify genomic regions differentiated between sexes
- compute marker depth statistics
RADSex results can be visualized with the `radsex-vis` R package available here: https://github.com/RomainFeron/RADSex-vis.
Although RADSex has been developed specifically to study sex-determination, it was designed to be flexibla and can be used to compare any two populations.
This pipeline was developed in the [LPGP](https://www6.rennes.inra.fr/lpgp/) lab from INRA, Rennes, France for the PhyloSex project, which investigates sex determining factors in a wide range of fish species.
## Documentation
This README file contains a simple documentation, including a basic installation guide as well as a quick start section.
The full documentation for RADSex (still under construction) is available [there](https://radsex.readthedocs.io/en/latest).
It contains a complete Getting Started section, a detailed usage for all functions, and real-life datasets examples covering many situations.
This README file includes a basic installation guide and a quick start section. The full documentation for RADSex, including a complete example walktrhough, is available [here](https://romainferon.github.io/RADSex/).
## Installation
## Requirements
### Requirements
- A C++11 compliant compiler (GCC >= 4.8.1, Clang >= 3.3)
- The zlib library (which should be installed on linux by default)
- The zlib library (usually installed on linux by default)
## Installation
### Install the latest official release
- Clone: `git clone https://github.com/RomainFeron/RadSex.git`
- *Alternative: download the archive and unzip it*
- Go to the RadSex directory (`cd RadSex`)
- Download the [latest release](https://github.com/RomainFeron/RadSex/releases)
- Unzip the archive
- Navigate to the `RADSex` directory
- Run `make`
- The compiled `radsex` binary is located in `RadSex/bin/`
The compiled `radsex` binary will be located in **RADSex/bin/**.
### Install the latest stable development version
```bash
git clone https://github.com/RomainFeron/RADSex.git
cd RADSex
make
```
The compiled `radsex` binary will be located in **RADSex/bin/**.
### Install RADSex with Conda
RADSex is available in [Bioconda](https://bioconda.github.io/recipes/radsex/README.html?#recipe-Recipe%20&#x27;radsex&#x27;). To install RADSex with Conda, run the following command:
```bash
conda install -c bioconda radsex
```
## Quick start
#### Before starting
### Preparing the data
Before running the pipeline, you should prepare the following elements:
- A **set of demultiplexed reads**. The current version of RADSex does not implement demultiplexing;
raw sequencing reads can be demultiplexed using [Stacks](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php)
or [pyRAD](http://nbviewer.jupyter.org/gist/dereneaton/af9548ea0e94bff99aa0/pyRAD_v.3.0.ipynb#The-seven-steps-described).
- A **population map** (popmap): a tabulated file with individual ID as the first column and sex as the second column.
It is important that the individual IDs in the popmap are the same as the names of the demultiplexed reads files (see the doc for details).
- If you want to map the sequences to a reference genome: a **reference genome** in fasta format.
Note that when visualizing `mapping` results with `radsex-vis`, linkage groups / chromosomes are automatically inferred from scaffold names in the reference sequence
if their name starts with *LG*, *chr*, or *chromosome* (case unsensitive).
If chromosomes are named differently in the reference genome, you should prepare a tabulated file with reference scaffold ID in the first column and corresponding chromosome name in the second column (see the doc for details)
#### Computing the coverage table
- A **population map** (popmap): a tabulated file with individual IDs in the first column and sex (or group) in the second column. Individual IDs in the popmap must be the same as the names of the demultiplexed reads files (*e.g.* 'individual1' for the reads file 'individual1.fq.gz')
- To align the markers to a genome: the **genome** sequence in a FASTA file.
Note that when visualizing `map` results with `radsex-vis`, linkage groups / chromosomes are automatically inferred from scaffold names in the reference sequence
if their name starts with *LG*, *CHR*, or *NC* (case unsensitive).
If chromosomes are named differently in the reference genome, you can use a tabulated file with contig ID in the first column and corresponding chromosome name in the second column (see the doc for details)
### Computing the marker depths table
The first step of RADSex is to create a table of marker depths for the entire dataset using the `process` command:
```bash
radsex process --input-dir ./samples --output-file markers_table.tsv --threads 16 --min-depth 1
```
In this example, demultiplexed reads are located in **./samples** and the markers table generated by `process` will be saved to **markers_table.tsv**. The parameter `--threads` specifies the number of threads to use, and `--min-depth` specifies the minimum depth to consider a marker present in an individual: markers which are not present with depth higher than this value in at least one individual will not be retained in the markers table.
It is advised to keep the minimum depth to the default value of 1 for this step, as it can be adjusted for each analysis later.
### Computing the distribution of markers between sexes
The `distrib` command computes the distribution of markers between males and females from a marker depths table:
The first step of RADSex is to create a table of coverage for the dataset using the `process` command:
```bash
radsex distrib --markers-table markers_table.tsv --output-file distribution.tsv --popmap popmap.tsv --min-depth 5 --groups M,F
```
`radsex process --input-dir ./samples --output-file coverage_table.tsv --threads 16 --min-coverage 1`
In this example, `--markers-table` is the table generated with `process` and the distribution of markers between males and females will be saved to **distribution.tsv**. The sex of each individual in the population is given by **popmap.tsv**. Groups of individuals to compare (as defined in the popmap) are specified manually with the parameter `--groups`. The minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual.
In this example, demultiplexed reads are stored in `./samples` and the coverage table generated by `process` will be stored in `coverage_table.tsv`. The parameter `--threads` specifies the number of threads to use.
The resulting distribution can be visualized with the `plot_sex_distribution()` function of [RADSex-vis](https://github.com/RomainFeron/RADSex-vis), which generates a tile plot of marker counts with number of males on the x-axis and number of females on the y-axis.
The parameter `--min-coverage` specifies the minimum coverage value to consider a sequence present in an individual:
sequences which are not present with coverage higher than this value in at least one individual will not be retained in the coverage table.
It is advised to keep the minimum coverage to 1 for this step, as it can be adjusted for each analysis later.
#### Computing the distribution of sequences between sexes
### Extracting markers significantly associated with sex
After generating the coverage table, the `distrib` command is used to compute the distribution of sequences between sexes:
Markers significantly associated with sex are obtained with the `signif` command:
`radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-coverage 5`
```bash
radsex signif --markers-table markers_table.tsv --output-file markers.tsv --popmap popmap.tsv --min-depth 5 --groups M,F [ --output-fasta ]
```
In this example, the input file `--input-file` is the coverage table generated in the [previous step](#computing-the-coverage-table), and the distribution of sequences between sexes will be stored in `distribution.tsv`.
The sex of each individual in the population is given by `popmap.tsv` (see the [popmap section](#population-map) for details).
The minimum coverage to consider a sequence present in an individual is set to 5, meaning that sequences present with coverage (depth) lower than 5 in one individual will not be counted in this individual.
In this example, `--markers-table` is the table generated with `process` and markers significantly associated with sex are saved to **markers.tsv**. The sex of each individual in the population is given by **popmap.tsv** (see the :ref:`population-map` section). Groups of individuals to compare (as defined in popmap) are specified manually with the parameter `--groups`. The minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual.
This distribution can be visualized with the `plot_sex_distribution()` function of `radsex-vis`, which generates a tile plot.
By default, the `signif` function generates an output file in the same format as the markers depth table. Markers can also be exported to a fasta file using the parameter `--output-fasta`.
#### Extracting sequences significantly associated with sex
The markers table generated by `signif` can be visualized with the `plot_depth()` function of [RADSex-vis](https://github.com/RomainFeron/RADSex-vis), which generates a heatmap showing the depth of each marker in each individual.
Sequences significantly associated with sex can be obtained with the `signif` command:
`radsex signif --input-file coverage_table.tsv --output-file sequences.tsv --popmap-file popmap.tsv --min-coverage 5 [ --output-format fasta ]`
### Aligning markers to a genome
In this example, the input file `--input-file` is the coverage table generated in the [first step](#computing-the-coverage-table), and the sequences significantly associated with sex will be stored in `sequences.tsv`.
The sex of each individual in the population is given by `popmap.tsv` (see the [popmap section](#population-map) for details),
and the minimum coverage to consider a sequence present in an individual is set to 5 (see the [previous section](#computing-the-distribution-of-sequences-between-sexes)).
By default, the `signif` function exports a small coverage table; sequences can be exported to fasta using the `--output-format` parameter.
Markers can be aligned to a genome using the `map` command:
The coverage table generated by `signif` can be visualized with the `plot_coverage()` function of `radsex-vis`, which generates a heatmap of coverage.
```bash
radsex map --markers-file markers_table.tsv --output-file alignment_results.tsv --popmap popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-depth 5 --groups M,F
```
#### Mapping sequences to a reference genome
In this example, `--markers-file` is the markers depth table generated with `process` and the path to the reference genome file is given by `--genome-file`; results will are saved to **alignment_results.tsv**. The sex of each individual in the population is given by **popmap.tsv** and the minimum depth to consider a marker present in an individual is set to 5, meaning that markers with depth lower than 5 in an individual will not be considered present in this individual. Groups of individuals to compare (as defined in the popmap) are specified manually with the parameter `--groups`
Sequences can be mapped to a reference genome using the `map` command:
The parameter `--min-quality` specifies the minimum mapping quality (as defined in [BWA](http://bio-bwa.sourceforge.net/bwa.shtml)) to consider a marker properly aligned and is set to 20 in this example. The parameter `--min-frequency` specifies the minimum frequency of a marker in the population to retain this marker and is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of the population are aligned to the genome.
`radsex map --input-file coverage_table.tsv --output-file mapping.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-quality 20 --min-frequency 0.1 --min-coverage 5`
Alignment results from `map` can be visualized with the `plot_genome()` function of [RADSex-vis](https://github.com/RomainFeron/RADSex-vis), which generates a circular plot showing bias and association with sex for each marker aligned to the genome.
In this example, the input file `--input-file` is the coverage table generated in the [first step](#computing-the-coverage-table), the mapping results will be stored in `sequences.tsv`,
and the path to the reference genome file is given by `--genome-file`. The sex of each individual in the population is given by `popmap.tsv` (see the [popmap section](#population-map) for details),
and the minimum coverage to consider a sequence present in an individual is set to 5 (see the [previous section](#computing-the-distribution-of-sequences-between-sexes)).
The parameter `--min-quality` specifies the minimum mapping quality (as defined in [BWA](http://bio-bwa.sourceforge.net/bwa.shtml)) to consider a sequence mapped (`--min-quality`), here set to 20.
The parameter `--min-frequency` specifies the minimum frequency of a sequence in at least one sex; it is set to 0.1 here, meaning that only sequences present in at least 10% of individuals of one sex are retained for mapping.
Alignment results for a specific contig can be visualized with the `plot_contig()` function to show the same metrics for a single contig.
The mapping results generated by `map` can be visualized with the `plot_genome()` function of `radsex-vis`, which generates a circos plot for the entire genome.
Mapping results for a specific scaffold can be visualized with the `plot_contig()` function to generate a linear plot for the specified scaffold.
## LICENSE
Copyright (C) 2018 Romain Feron
Copyright (C) 2018-2020 Romain Feron
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
......
......@@ -3,83 +3,42 @@ CONFIG += console c++11
CONFIG -= app_bundle
CONFIG -= qt
SOURCES += main.cpp \
include/edlib/edlib.cpp \
include/kfun/kfun.cpp \
src/barcodes_file.cpp \
src/demultiplexing.cpp \
src/frequencies.cpp \
src/group_loci.cpp \
src/main.cpp \
src/mapping.cpp \
src/output.cpp \
src/parameters.cpp \
src/popmap_file.cpp \
src/process_reads.cpp \
src/radsex.cpp \
src/scaffold_lengths.cpp \
src/sequence_file.cpp \
src/sex_distribution.cpp \
src/significant_sequences.cpp \
src/stats.cpp \
src/subset.cpp \
src/utils.cpp \
include/bwa/bntseq.c \
include/bwa/bwa.c \
include/bwa/bwamem.c \
include/bwa/bwamem_extra.c \
include/bwa/bwamem_pair.c \
include/bwa/bwt.c \
include/bwa/bwt_gen.c \
include/bwa/bwtindex.c \
include/bwa/is.c \
include/bwa/kstring.c \
include/bwa/ksw.c \
include/bwa/kthread.c \
include/bwa/malloc_wrap.c \
include/bwa/QSufSort.c \
include/bwa/rle.c \
include/bwa/rope.c \
include/bwa/utils.c \
src/depth.cpp
LIBS += -L$$PWD/include/bwa/ -lbwa -pthread -static-libstdc++ -lz
INCLUDEPATH += $$PWD/include/bwa $$PWD/include
DEPENDPATH += $$PWD/include/bwa
PRE_TARGETDEPS += $$PWD/include/bwa/libbwa.a
HEADERS += \
include/bwa/bntseq.h \
include/bwa/bwa.h \
include/bwa/bwamem.h \
include/bwa/bwt.h \
include/bwa/kbtree.h \
include/bwa/khash.h \
include/bwa/kseq.h \
include/bwa/ksort.h \
include/bwa/kstring.h \
include/bwa/ksw.h \
include/bwa/kvec.h \
include/bwa/malloc_wrap.h \
include/bwa/QSufSort.h \
include/bwa/rle.h \
include/bwa/rope.h \
include/bwa/utils.h \
include/edlib/edlib.h \
include/kfun/kfun.h \
include/kseq/kseq.h \
src/analysis.h \
src/barcodes_file.h \
src/demultiplexing.h \
src/frequencies.h \
src/group_loci.h \
src/mapping.h \
src/output.h \
src/parameter.h \
src/arg_parser.h \
src/depth.h \
src/distrib.h \
src/freq.h \
src/map.h \
src/marker.h \
src/markers_table.h \
src/parameters.h \
src/popmap_file.h \
src/process_reads.h \
src/radsex.h \
src/scaffold_lengths.h \
src/sequence_file.h \
src/sex_distribution.h \
src/significant_sequences.h \
src/popmap.h \
src/process.h \
src/signif.h \
src/stats.h \
src/subset.h \
src/utils.h \
src/depth.h
src/utils.h
SOURCES += \
src/analysis.cpp \
src/depth.cpp \
src/distrib.cpp \
src/freq.cpp \
src/main.cpp \
src/map.cpp \
src/marker.cpp \
src/markers_table.cpp \
src/popmap.cpp \
src/process.cpp \
src/signif.cpp \
src/stats.cpp \
include/kfun/kfun.cpp \
src/subset.cpp
This diff is collapsed.
# Minimal makefile for Sphinx documentation
#
SPHINXOPTS = -b html -c .
SPHINXBUILD = sphinx-build
SOURCEDIR = src
BUILDDIR = html
DOXYGEN_BUILDDIR = html/doxygen
DOXYGEN_HTMLDIR = html/doxygen
DOXYGEN_SOURCEDIR = ../src
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SOURCEDIR = .
BUILDDIR = _build
.PHONY: all clean
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
all: $(BUILDDIR) $(DOXYGEN_BUILDDIR)
.PHONY: help Makefile
clean:
rm -rf $(BUILDDIR)
rm -rf $(DOXYGEN_BUILDDIR)
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
rebuild: clean $(BUILDDIR) $(DOXYGEN_BUILDDIR)
$(BUILDDIR): $(SOURCEDIR)/*.rst
@$(SPHINXBUILD) "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS)
$(DOXYGEN_BUILDDIR): Doxyfile $(DOXYGEN_SOURCEDIR)/*.h $(DOXYGEN_SOURCEDIR)/*.cpp
doxygen $^
mv $(DOXYGEN_BUILDDIR)/html/* $(DOXYGEN_HTMLDIR)
rm -rf $(DOXYGEN_BUILDDIR)/html
......@@ -16,13 +16,12 @@
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import os
# -- Project information -----------------------------------------------------
source_suffix = ['.rst']
project = 'RADSex'
copyright = '2018, Romain Feron'
copyright = '2018-2020, Romain Feron'
author = 'Romain Feron'
# The short X.Y version
......
Example walkthrough
===================
In this example, we will run RADSex on a public *Oryzias latipes* RAD-Sequencing dataset. We will detail each step of the process, highlight important details, and show how to use the R package ``radsex-vis`` to generate plots from the output of ``radsex``. This guide assumes that ``radsex`` and the ``radsex-vis`` package have already been installed. For specific instruction about installing ``radsex`` and ``radsex-vis``, check the :ref:`install-release` section. All reported times and resources usage were measured on a desktop computer with an Intel i7-8700K 4.7 GHz processor, 32 Gb of memory, and a standard 7200 RPM Hard Disk Drive. The input data, results (except for ``process``), and figures can be found in the *example* directory.
Preparing the data
------------------
The RAD-Sequencing datasets used in this example are available on the Sequence Read Archive on NCBI. The reads were demultiplexed before being deposited on NCBI, and samples were grouped in two projects, males and females. The accession number for **female** samples is **SRS662264**, and the accession number for **male** samples is **SRS662265**. For convenience, simple scripts to download male and female samples from the EBI ftp can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_female_samples.sh>`__ for female samples and `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/download_male_samples.sh>`_ for male samples. This dataset was published in `Wilson et al 2014 <http://www.genetics.org/content/early/2014/09/18/genetics.114.169284>`__.
The population map describing the sex of each samples for this dataset can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/population_map.tsv>`__. The genome used for mapping markers with `radsex map` was that of a HSOK strain, NCBI accession number **GCA_002234695.1** (`link <https://www.ncbi.nlm.nih.gov/assembly/GCA_002234695.1>`_). The chromosomes names file used to display chromosomes with nicer names in the genome mapping plot can be found `here <https://github.com/RomainFeron/RadSex/tree/master/example/oryzias_latipes/data/chromosomes_names.tsv>`__.
.. note:: RADSex uses file names to generate individual IDs. Therefore, individual names in the population map should correspond to the file names without their extensions. Check the file names and population map provided above for an example of how to build the population map from file names. More details about the population map can be found in the :ref:`population-map` section.
From now on, we will assume the following directory structure:
::
.
├─── samples
| ├────── xxx.fastq.gz
| ├────── xxx.fastq.gz
| ├────── ...
| └────── xxx.fastq.gz
├─── chromosomes_names.tsv
├─── genome.fasta
└─── popmap.tsv
Generating a coverage table for the entire dataset
--------------------------------------------------
The first step of RADSex is to create a table containing the coverage of each marker in each individual for the entire dataset; a marker is defined as a non-polymorphic sequence (no mismatches or SNPs). This step is performed with the ``process`` command :
::
radsex process --input-dir samples --output-file coverage_table.tsv --threads 8 --min-cov 1
The ``--input-dir`` parameter specifies the location of the demultiplexed reads directory, which is *samples* in our case. The ``--output-file`` parameter specifies the location of the output file (*i.e.* the table of coverage), which is *coverage_table.tsv* here. This step can be parallelized using the ``--threads`` parameters, which we set to *8* in our example; you should adjust this value based on your computer's specifications. Finally, the ``--min-cov`` parameter specifies the minimum coverage for a marker to be considered present in an individual; if a marker has a coverage lower than the value of ``--min-cov`` in every individual, it will not be retained in the coverage table.
The resulting file *coverage_table.tsv* will be used as a base for all analyses implemented in ``radsex``, but it is not used for any ``radsex-vis`` plots. For more information about this file, check the :ref:`coverage-table-file` section.
.. note:: In most cases, we advise to keep the value of ``--min-cov`` to 1 in order to retain all the information from the dataset in this step. Filtering for minimum coverage should be done in the following analysis steps, and it will be easier to try several minimum coverage values this way. If you are certain that all individuals in your dataset have high coverage, and you do not plan to run analyses with a minimum coverage of 1, you can increase this threshold.
With our setup, using 8 cores, this step completed in **9 min 25 seconds** with a peak memory usage of **10.3 GB**. The resulting coverage table used 5.1 GB of disk space.
Computing the distribution of markers between sexes
---------------------------------------------------
The main analysis implemented in ``radsex`` computes a table summarizing the distribution of all markers between males and females. This analysis is performed with the ``distrib`` command:
::
radsex distrib --input-file coverage_table.tsv --output-file distribution.tsv --popmap-file popmap.tsv --min-cov 5``
The ``--input-file`` parameter specifies the location of the coverage table generated in the previous step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, *i.e* the table of distribution of markers between males and females, which is *distribution.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the :ref:`population-map` section for details), which we named *popmap.tsv* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* here.
With our setup, this step completed in **36 seconds** with a peak memory usage of **4 Mb**.
The resulting file *distribution.tsv* is a tabulated file described in the :ref:`population-map` section. This file can be visualized with ``radsex-vis`` using the ``plot_sex_distribution`` function:
::
radsexvis::plot_sex_distribution("distribution.tsv", output_file_path = "distribution.png")
To generate a basic plot, the only required parameter is the full path to a distribution table (simplified as "distribution.tsv" in this example). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. For a full description of the ``plot_sex_distribution()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../example/figures/distribution.png
This figure is a tile plot with number of males on the x-axis and number of females on the y-axis. The color of a tile at coordinates (**x**, **y**) indicates the number of markers that were present in any **x** males and any **y** females. For instance, in this figure, there were between 25 and 99 markers found in 29 males (not necessarily always the same 29 males) and in 0 females. Tiles for which association with sex is significant (chi-squared test, using Bonferroni correction) are highlighted in red. Many markers found predominantly in males are significantly associated with sex, indicating that an XX/XY system determines sex in this species. Interestingly, there are no markers found in all males or all but one males and absent from all females, *i.e* no markers found at position (30, 0) and (31, 0).
With our setup, this step completed in **36 seconds** with a peak memory usage of **4 MB**.
Extracting markers significantly associated with sex
----------------------------------------------------
The ``signif`` command of RADSex extracts all markers for which association with sex is significant. In this case, these markers are the ones represented by the tiles highlighted in red in the previous figure. To extract all significant markers from our dataset, run the following command :
::
radsex signif --input-file coverage_table.tsv --output-file significant_markers.tsv --popmap-file popmap.tsv --min-cov 5
The ``--input-file`` parameter specifies the location of the coverage table generated in the ``process`` step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, in this case a subset of the table of coverage, which we named *significant_markers.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the xx section for details), which we named *popmap.tsv* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* to match the value used in the previous analysis.
The subset of coverage table generated by ``signif`` can be visualized with ``radsex-vis`` the ``plot_coverage()`` function :
::
radsexvis::plot_coverage("significant_markers.tsv", output_file_path = "significant_markers.png", popmap_file_path = "popmap.tsv")
To generate a basic plot, the only required parameter is the full path to the subset of coverage table (simplified as "significant_markers.tsv" in this example). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. The ``popmap_file_path`` parameter can be specified to color males and females IDs in the resulting figure. For a full description of the ``plot_coverage()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below:
.. image:: ../example/figures/significant_markers.png
This figure is a heatmap with individuals on the x-axis and markers on the y-axis. The color of a tile at coordinates (**x**, **y**) indicates the coverage of a marker **y** in individual **x**. Both individuals and markers can be clustered based on this coverage, and clustering dendrograms are displayed by default. If a popmap is specified, males and females IDs are colored differently. In this example, two males cluster with the females, in agreement with the results from ``distrib`` where male-specific markers were always missing from two males. These two males are actually genetic females whose sex was mis-assigned.
.. note:: For convenience, significant markers can be exported in FASTA format, using the parameter --output-format fasta. Headers contain information about the sex distribution of each marker, as described in the :ref:`fasta-file` section.
With our setup, this step completed in **37 seconds** with a peak memory usage of **6 MB**.
Mapping markers to a reference genome
-------------------------------------
When a reference genome is available, markers can be aligned to it in order to locate sex-differentiated regions. This is done using the ``map`` command:
::
radsex map --input-file coverage_table.tsv --output-file mapping_results.tsv --popmap-file popmap.tsv --genome-file genome.fasta --min-cov 5
The ``--input-file`` parameter specifies the location of the coverage table generated in the ``process`` step, which was *coverage_table.tsv* in our case. The ``--output-file`` parameter specifies the location of the output file, in this case a table with mapping information, which we named *mapping_results.tsv* here. The ``--popmap-file`` parameter specifies the location of the population map (see the xx section for details), which we named *popmap.tsv* in this example. The ``--genome-file`` parameter specifies the location of reference genome file in FASTA format, which we named *genome.fasta* in this example. Finally, the ``--min-cov`` parameter specifies the minimum coverage to consider a marker present in an individual, and was set to *5* to match the value used in the previous analysis.
The resulting file *mapping_results.tsv* is a tabulated file described in the :ref:`mapping-results-file` section. This file can be visualized with ``radsex-vis`` using the ``plot_genome()`` function:
::
radsexvis::plot_genome("mapping_results.tsv", "genome.fasta.lengths", chromosomes_names_file_path = "chromosomes_names.tsv", output_file_path = "mapping_genome.png")
To generate a basic plot, the only required parameters are the full path to the mapping results table (simplified as "mapping_results.tsv" in this example), and the full path to the genome contig lengths generated by ``map`` ("genome.fasta.lengths" here). The ``output_file_path`` parameters specifies the path to an output file where the figure will be saved; if this parameter is not specified, the plot will be generated in the default R graphic device. The ``chromosomes_names_file_path`` parameter can be specified to rename the chromosomes with chosen IDs specified in the file. For a full description of the ``plot_genome()`` function, including additional parameters, check the TODO_RADSEXVIS_SECTION.
The resulting figure is displayed below: