Commit 1a94b14a authored by psimion's avatar psimion
Browse files

readme major updates & fasta output bug corrected

parent dc23b0ce
......@@ -2,7 +2,7 @@
<img src="utils/misc/Logo.png" alt="Drawing" width="250"/>
</p>
CroCo : a program to detect and remove cross contaminations in assembled transcriptomes
CroCo : a program to detect and remove cross contamination in assembled transcriptomes
=================
#### **Paul Simion, Khalid Belkhir, Clémentine François, Julien Veyssier, Jochen Rink, Michaël Manuel, Hervé Philippe, Max Telford**
......@@ -19,9 +19,9 @@ CroCo : a program to detect and remove cross contaminations in assembled transcr
# Introduction
CroCo is a program to detect cross contaminations in assembled transcriptomes using sequencing reads to determine the true origin of every transcripts.
Such cross contaminations can be expected if several RNA-Seq experiments were prepared during the same period at the same lab, or by the same people, or if they were processed or sequenced by the same sequencing service company.
Our approach first determines a subset of transcripts that are suspiciously similar across samples using a BLAST procedure. CroCo then combine all transcriptomes into a metatranscriptome and quantifies the "expression level" of all transcripts successively using every sample read data (e.g. several species sequenced by the same lab for a particular study) while allowing read multi-mappings.
CroCo is a program to detect cross contamination events in assembled transcriptomes using sequencing reads to determine the true origin of every transcripts.
Such cross contaminations can be expected if several RNA-Seq experiments were prepared during the same period at the same lab, or by the same people, or if they were processed or sequenced by the same sequencing service facility.
Our approach first determines a subset of transcripts that are suspiciously similar across samples using a pairwise BLAST procedure. CroCo then combine all transcriptomes into a metatranscriptome and quantifies the "expression level" of all transcripts successively using every sample read data (e.g. several species sequenced by the same lab for a particular study) while allowing read multi-mappings.
Several mapping tools implemented in CroCo can be used to estimate expression level (default is RapMap).
This information is then used to categorize each transcript in the following 5 categories :
......@@ -31,7 +31,7 @@ This information is then used to categorize each transcript in the following 5 c
- **low coverage**: expression levels are too low in all samples, thus hampering our procedure (which relies on differential expression) to confidently assign it to any category.
- **over expressed**: expression levels are very high in at least 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (e.g. *Escherichia coli* contaminations).
CroCo outputs cross contamination statistics, quantifications of suspect transcripts in all samples, optionally up to the five categorized transcriptome fasta files per sample, and optionally several graphical networks of ascertained cross contaminations and dubious cases.
CroCo outputs cross contamination statistics, transcript quantifications in all samples, optionally up to the five categorized transcriptome fasta files per sample, and optionally several graphical networks of ascertained cross contamination events as well as of dubious cases.
<p align="center">
<img src="utils/misc/CroCo_categories_300p.png" alt="Drawing" width="400"/>
......@@ -40,42 +40,47 @@ CroCo outputs cross contamination statistics, quantifications of suspect transcr
---
**Important Note - NGS data type and quality**
**Important Note I - NGS data type and quality**
CroCo has been designed with RNA-Seq data in mind as our approach uses differential expression to detect cross contaminations and can only work with sequencing data that bear information on expression level.
Illumina data are therefore currently the best-suited type of RNA-Seq data for CroCo, but other types of transcriptomic data (such as 454 or PacBio) can also be used
in which case the user should adjust some of the parameters (see [Detailed options](#detailed-options))
Fundamentaly, the only requirement for CroCo to be effective is that sequence coverage must be correlated to the quantity of molecules in the original sample. In a near future, we will work on the use of our approach to detect cross contamination in genomic data.
CroCo has been designed with RNA-Seq data in mind as our approach uses differential expression to detect cross contaminations and can only work with sequencing data that bear information reagrding the quantity of molecules present in a sample. This is typically the case for transcriptomic data used to estimate transcripts expression levels (But see below for the potential use of CroCo on genomic data). Illumina data are therefore currently the best-suited type of RNA-Seq data for CroCo, but other types of transcriptomic data (such as 454 or PacBio) can also be used in which case the user might want to adjust some of the parameters used (see [Detailed options](#detailed-options)).
Additionaly, and regardless of the sequencing technology used, quality of the assembled transcriptomes used as input for CroCo is important to maximize the accuracy of our approach. For example, it is expected that transcriptome redundancy such as natural transcripts variants could potentially create false positive results since the variants might likely display different expression levels.
---
**Important Note - closely-related samples**
**Important Note II - closely-related samples**
CroCo's procedure make use of sequence similarity to detect potential cross contaminations events and to quantify their abundance.
As a result, our approach will loose power if too closely-related samples are analysed,
leading to a potential over-estimation of the number of cross contaminations, of dubious cases and of over-expressed cases.
We therefore recommend not to analyse samples with very low sequence divergence levels (e.g. 1% divergence), such as in the context of population studies or for tumoral vs. normal cell comparisons.
CroCo's procedure make use of sequence similarity to detect potential cross contaminations events and to quantify their abundance. As a result, our approach will loose power if too closely-related samples are analysed, leading to a potential over-estimation of the number of cross contamination events, of dubious cases and of over-expressed cases. We therefore recommend not to analyse samples with very low sequence divergence levels (e.g. 1% divergence), such as in the context of population studies or for tumoral vs. normal cell comparisons.
Note, however, that even in such cases, the categorization of transcripts as clean will always be correct : you would only be discarding more transcripts than necessary.
---
**Important Note III - CroCo and genomic data**
Fundamentally, the only requirement for CroCo to be effective is that sequence coverage must be correlated to the quantity of molecules present in a given sample. Although CroCo's accuracy and efficiency has been only shown for transcriptomic data in its original publication, it is however clear that some configurations allow for the use of CroCo on genomic data. Recently, some of our colleagues (ISEM, University of Montpellier, France) successfully used CroCo to detect cross contamination events in genomic coding sequences (CDS) of various lepidopteran species (unpublished results). Treating only genomic CDS renders their experimental settings somewhat similar to that of an analysis of transcriptomic data. They detected one single significant cross contamination event which perfectly matched their prior expectation given preliminary results with that particular sample. These results confirmed that using CroCo on genomic data is at the very least sometimes possible and sound.
While more analyses will be needed in order to fully understand the limitations of CroCo when working with genomic data, here are some preliminary guidelines for interested users:
- For each scaffolds, CroCo will compute one single "quantification" estimate (i.e. the average coverage normalized by scaffolds length and by the total quantity of data mapped to the complete set of sequence). Scaffolds used should therefore correspond to genomic regions with relatively homogeneous coverage. This is why restricting the analysis to short genomic segments only (e.g. such as genes) might be necessary.
- As cross contamination events might have occur prior to PCR amplifications of the various samples under study, it is likely that PCR duplicates should be removed before using CroCo.
---
**Citation**
If you use CroCo in your work, please cite:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Simion et al. BMC Biology (2018) 16:28 DOI 10.1186/s12915-018-0486-7
<div class="page-break"></div>
# Table of content
1. [Quick installation guide](#quick-installation-guide)
2. [Installation & Requirements](#installation-&-requirements)
* [Install CroCo dependencies](#Install-CroCo-dependencies)
* [Using CroCo through Docker](#Using-CroCo-through-Docker)
2. [Installation and Requirements](#installation-and-requirements)
* [Install CroCo dependencies](#install-croCo-dependencies)
* [Install CroCo through Docker](#install-croCo-through-docker)
* [Optional requirements](#optional-requirements)
* [Installation tests](#installation-tests)
3. [Usage](#usage)
......@@ -152,6 +157,9 @@ As CroCo is a bash script, and therefore do not require to be installed. However
- Kallisto-0.43.0
- Rapmap-0.1.0
**Optional dependencies** :
- R-3.1.0 (see [Optional requirements](#optional requirements) below)
The install script (`install_dependencies.sh`) let you install all these dependencies automatically.
It can be used to install only one tool at a time (e.g. `--tool B`), or all in once (i.e. `--tool all`).
They will be found automatically by CroCo.
......@@ -178,7 +186,7 @@ For more information on this necessary step, please see the following links :
**You are also free to install CroCo dependencies manually**, without using our install script. The only important thing is to make sure they are in the PATH when you launch CroCo. CroCo will first look for the tools within its own directory in `utils/bin`, then in the PATH.
## Using CroCo through Docker
## Install CroCo through Docker
Using Docker will allow the creation of an image of CroCo self-containing all its dependencies.
It is therefore unecessary to install any dependencies, but you are required to download and install Docker on your system (see [here](https://docs.docker.com)).
......@@ -197,6 +205,26 @@ You will then be able to directly launch the CroCo image you just built through
## Optional requirements
**Using R to generate a graphical network of cross contaminations**
Provided additional softwares install, CroCo can automatically outputs a dynamic visualization of the cross contamination network connecting the samples if option `--graph` is set to `yes`.
This option will only work if the following tools are installed on your system :
- **R**
- R library package **visNetwork**
- R library package **igraph**
Within R, the two libraries can be installed as follows:
```bash
install.packages("visNetwork")
install.packages("igraph")
```
In order to successfully install the *igraph* package, R version 3.1.0 or more recent is needed.
If R and these packages are not installed, again, no worries ! You will find among CroCo's outputs a folder named `network_info` that contains every files you need (called *NODES* and *LINKS*) to build your own graphical network using other tools such as DiagrammeR or Gephi.
---
**Adding CroCo in your PATH**
It is good practice to add CroCo's location in your PATH, which allow you to use CroCo from any location in your system. This is done by using this command:
......@@ -221,26 +249,6 @@ bash /my/complete/path/to/CroCo/directory/CroCo_v0.1.sh --mode p --tool B
---
**Using R to generate a graphical network of cross contaminations**
Provided additional softwares install, CroCo can automatically outputs a dynamic visualization of the cross contamination network connecting the samples if option `--graph` is set to `yes`.
This option will only work if the following tools are installed on your system :
- **R**
- R library package **visNetwork**
- R library package **igraph**
Within R, the two libraries can be installed as follows:
```bash
install.packages("visNetwork")
install.packages("igraph")
```
In order to successfully install the *igraph* package, R version 3.1.0 or more recent is needed.
If R and these packages are not installed, again, no worries ! You will find among CroCo's outputs a folder named `network_info` that contains every files you need (called *NODES* and *LINKS*) to build your own graphical network using other tools such as DiagrammeR or Gephi.
---
## Installation tests
You can check if your installation of CroCo is working by starting an analysis on a provided small example dataset (`CroCo_dataset_test`) that conforms to input pre-requisites.
......@@ -296,7 +304,7 @@ CroCo will create a directory containing all results which will be placed within
```bash
Usage :
CroCo_v0.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum-coverage FLOAT] [--threads INT] [--output-prefix STR] [--output-level 1|2|3] [--graph yes|no] [--trim5 INT] [--trim3 INT] [--frag-length FLOAT] [--frag-sd FLOAT] [--suspect-id INT] [--suspect-len INT] [--add-option STR] [--recat STR]
CroCo_v0.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum-coverage FLOAT] [--threads INT] [--output-prefix STR] [--output-level 1|2] [--graph yes|no] [--trim5 INT] [--trim3 INT] [--frag-length FLOAT] [--frag-sd FLOAT] [--suspect-id INT] [--suspect-len INT] [--add-option STR] [--recat STR]
--mode p|u :\t\t\t'p' for paired and 'u' for unpaired (default : 'p') [short: -m]
--in STR :\t\t\tName of the directory containing the input files to be analyzed (DEFAULT : working directory) [short: -i]
......@@ -306,7 +314,7 @@ CroCo_v0.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum
--overexp FLOAT :\t\t\tTPM value (DEFAULT : 300) [short: -d]
--threads INT :\t\t\tNumber of threads to use (DEFAULT : 1) [short: -n]
--output-prefix STR :\t\tPrefix of output directory that will be created (DEFAULT : empty) [short: -p]
--output-level 1|2|3 :\t\tSelect the fasta files to output. '1' for none, '2' for clean and lowcov, '3' for all (DEFAULT : 2) [short: -l]
--output-level 1|2 :\t\tSelect whether or not to output fasta files. '1' for none, '2' for all (DEFAULT : 2) [short: -l]
--graph yes|no :\t\tProduce graphical output using R (DEFAULT : no) [short: -g]
--add-option 'STR' :\t\tThis text string will be understood as additional options for the mapper/quantifier used (DEFAULT : empty) [short: -a]
--recat SRT :\t\t\tName of a previous CroCo output directory you wish to use to re-categorize transcripts (DEFAULT : no) [short: -r]
......@@ -324,10 +332,10 @@ Minimal working example :
CroCo_v0.1.sh --mode p 2>&1 | tee log_file
Exhaustive example :
CroCo_v0.1.sh --mode p --in data_folder_name --tool B --fold-threshold 2 --minimum-coverage 0.2 --overexp 300 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no 2>&1 | tee log_file
CroCo_v0.1.sh --mode p --in data_folder_name --tool R --fold-threshold 2 --minimum-coverage 0.2 --overexp 300 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no 2>&1 | tee log_file
Exhaustive example using shortcuts :
CroCo_v0.1.sh -m p -i data_folder_name -t B -f 2 -c 0.2 -d 300 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no 2>&1 | tee log_file
CroCo_v0.1.sh -m p -i data_folder_name -t R -f 2 -c 0.2 -d 300 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no 2>&1 | tee log_file
Example for re-categorizing previous CroCo results
CroCo_v0.1.sh -i data_folder_name -r previous_CroCo_results_folder_name -f 10 -c 0.5 -g yes 2>&1 | tee log_file
......@@ -381,39 +389,36 @@ All output files will be placed within an output directory created by CroCo. Its
**Detailed results**
For every sample, a `*.all` file is created. Its contains the quantification of the expression of each transcript in every sample included in the analysis, as well as the expression log2fold change between the expected transcript source and the most highly expressed unexpected source. It also contains the category attributed to every transcript.
For every sample, one `*.all` file is created. It contains the quantification of the expression of each suspect transcript in every sample included in the analysis, as well as the expression log2fold change between the expected transcript source and the most highly expressed unexpected source. It also contains the category attributed to every transcript (last column). Note that the `utility_files_CroCo/*.all_quants` files contain the quantification of expression levels for both suspects and unsuspected transcripts.
---
**Summary statistics**
Two files will also be created by CroCo :
- a `CroCo_summary` file which contains statistics on transcripts categorization for each samples
- a `CroCo_profiles` file which contains all sorted log2fold change values (computed between focal and most highly expressed alien samples), allowing for a raw visualization of the distribution and strengh of their cross contamination
- a `CroCo_summary` file which is the main output file containing various statistics on transcripts categorization for each samples
- a `CroCo_profiles` file which contains all sorted log2fold change values (computed between focal and most highly expressed alien samples). This file can be used to plot the log2fold change curves and thus visualize the distribution and certainty level of cross contamination events as well as potentialy refining threshold values.
---
**Categorized transcriptomes**
Up to five fasta files corresponding to the five transcript categories will be created (depending on the parameter value set for the `--output-level` option, which by default output none of these files).
By default, five fasta files corresponding to the five transcript categories will be created (depending on the parameter value set for the `--output-level` option, which can be set to 1 if you are not interested in the fasta files).
Recommendations for downstream analyses :
- ONLY use the *clean* transcripts for direct qualitative analyses, such as phylogenetic inferences, or presence/absence observations. There is a slight risk of missing some correct data, but you won't miss rigour.
- use both *clean* and *low coverage* transcripts if downstream analyses can adequatly detect and circumvent potential cross contaminations (such as a refined orthology assignment step).
- If necessary, scavenge transcripts from *dubious* category on a case-by-case basis, always checking for their true origin (e.g. BLASTing them on NCBI non-redundant nucleotidic database is a good start). If still in doubt, discard them.
- if necessary, scavenge transcripts from *dubious* category on a case-by-case basis, always checking for their true origin (e.g. BLASTing them on NCBI non-redundant nucleotidic database is a good start). If still in doubt, discard them.
- use *overexpressed* category on a case-by-case basis, as they are strongly expressed in several samples, which means they might stem from highly conserved genes of which it might not be trivial to determine the exact taxonomical origin. They could also come from external contamination shared by several samples. Users might want to evaluate these transcripts with other tools, such as [Busco](http://busco.ezlab.org/). Note that if you only analyze two samples, no transcript will ever be categorized as *overexpressed* as it requires that the transcript is highly expressed in at least three samples.
---
**Graphical networks**
If the option `--graph` is set to `yes`, CroCo will use R to create 4 graphical networks allowing for a nice overview of cross contamination preferential patterns. Two of them provide a view of cross contamination patterns, and two of them provide the same view for dubious contamination cases. In these networks, the size of the nodes represents the number of time the transcriptome contaminates other samples, their color represents the percentage of the transcriptome that is cross contaminated, and the links represent the number of contaminant transcripts. These sizes and colors are relative, and can not be compared across CroCo analyses.
If the option `--graph` is set to `yes`, CroCo will use R to create 4 graphical networks allowing for a nice overview of cross contamination preferential patterns. Two of them provide a view of cross contamination patterns, and two of them provide the same view for dubious contamination cases. In these networks, the size of a node represents the number of time this sample contaminated others, the color of a node represents the percentage of this transcriptome that is cross contaminated, and the links represent the number of cross contaminating transcripts. Note that these sizes and colors are relative, and can not be compared across CroCo analyses!
This option requires a working installation of R and the installation of two R libraries : *visNetwork* and *igraph*.
A first network corresponds to the exhaustive cross contamination patterns (i.e. *network_complete.html*)
while a second one is an arbitrarily simplified version of the same network in which all links representing less than 2% of the biggest link are ignored (i.e. *network_simplified.html*).
This simplification is useful for visualizing large networks when their is strong discrepancies between cross contamination links.
The other two networks created are the conterparts of those described above for dubious cross contamination cases (i.e. *network_dubious_complete.html* and *network_dubious_simplified.html*).
A first network corresponds to the exhaustive cross contamination patterns (i.e. *network_complete.html*) while a second one is an arbitrarily simplified version of the same network in which all links representing less than 2% of the biggest link are ignored (i.e. *network_simplified.html*). This simplification is useful for visualizing large networks when their is strong discrepancies between cross contamination links. The other two networks created are the conterparts of those described above for dubious cross contamination cases (i.e. *network_dubious_complete.html* and *network_dubious_simplified.html*).
To view these dynamic networks, simply open the corresponding html file with any internet browser (e.g. Firefox, Chromium). The network nodes can be selected (it then highlights them and their associated cross contaminations) and moved around as well. Here is an example command line to open the network with firefox :
......@@ -421,9 +426,7 @@ To view these dynamic networks, simply open the corresponding html file with any
firefox network_simplified.html &
```
Please note that if you want to move these html files elsewhere (e.g. to store CroCo results),
you'll also need to move along their associated folders,
respectively named *network_complete_files*, *network_simplified_files*, *network_dubious_complete_files* and *network_dubious_simplified_files* as the html files need them to display the networks.
Please note that if you want to move these html files (e.g. to store CroCo results elsewhere), you'll also need to move along their associated folders, respectively named *network_complete_files*, *network_simplified_files*, *network_dubious_complete_files* and *network_dubious_simplified_files* as the html files need them to display the networks!
------
......@@ -520,7 +523,7 @@ For conveniency, we suggest to add an `_` at the end of the string you would lik
**--output-level** (-l)
This allows you to choose the quantity of fasta files that will be created : none (`1`); clean and low coverage transcriptomes only (`2`, default); clean, contam, dubious, low coverage and overexpressed transcriptomes (`3`).
This allows you to choose if fasta files will be created : no fasta file (`1`); all fasta files (`2`, default) thus including clean, contam, dubious, low coverage and overexpressed transcriptomes.
---
......@@ -528,9 +531,9 @@ This allows you to choose the quantity of fasta files that will be created : non
Setting this parameter to `yes` (default is `no`) will enable the automated rendering of a graphical cross contamination network in a dynamic html file.
IMPORTANT : this option requires a working install of *R* with the two following libraries already installed : *visNetwork* and *igraph*.
IMPORTANT : this option requires a working install of *R* with the two following R libraries already installed : *visNetwork* and *igraph*.
In case you would like to see the cross contamination network of a previous CroCo analysis you ran with the `--graph` option set to `no`, you can use the `--recat` option and re-use your previous results with the same parameters as before, this time setting `--graph` to `yes`.
In case you would like to see the cross contamination network of a previous CroCo analysis you ran without having activated the `--graph` option, no worries: you can use the `--recat` option and re-use your previous results with the same parameters as before, this time setting `--graph` to `yes`.
---
......@@ -540,12 +543,11 @@ This parameter allows experimented users to specify additional options to the ma
It is possible this way to change these tools default parameter values according to your type of data in order to improve mapping/quantification efficiency.
*Example*:
Assuming, you want to increase Bowtie precision by only allowing a maximum of 1 mismatch per alignment :
Assuming you want to increase Bowtie precision by only allowing a maximum of 1 mismatch per alignment :
```bash
bash CroCo_v0.1.sh --mode u --tool B --add-option '-v 1'
```
---
**--recat** (-r)
......
No preview for this file type
......@@ -107,19 +107,19 @@ if [ $RECAT == "no" ]; then
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
B2) if [ $MODE == "u" ]
then
fastq=" -U "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
bowtie2 -p $PROCESSORS --no-unal --no-head $ADDOPT -a --quiet --omit-sec-seq --trim5 $TRIM5 --trim3 $TRIM3 -x $out/$toolidx/$toolidx $fastq | grep -v "@" | cut -f3 | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
# B2) if [ $MODE == "u" ]
# then
# fastq=" -U "$INDIR"/"${fasta_array[$k]}".fastq"
# else
# fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
# fi
# bowtie2 -p $PROCESSORS --no-unal --no-head $ADDOPT -a --quiet --omit-sec-seq --trim5 $TRIM5 --trim3 $TRIM3 -x $out/$toolidx/$toolidx $fastq | grep -v "@" | cut -f3 | \
# awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
# END{
# print "Contig", reads;
# for (i in ctg) totRPK += ctg[i]/ctgsize[i];
# for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
# ;;
K) if [ $MODE == "u" ]
then
if [ $FRAGLENGTH == 'none' ] || [ $FRAGSD == 'none' ]; then
......@@ -135,21 +135,21 @@ if [ $RECAT == "no" ]; then
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0;}; close(refseqs) }
{ if(NR>1) ctg[$1] = $5 } END{print "Contig", reads; for (i in ctg) print i, ctg[i]}' $fileout.quant/abundance.tsv > $fileout
;;
S) if [ $MODE == "u" ]
then
if [ $FRAGLENGTH == 'none' ] || [ $FRAGSD == 'none' ]; then
echo -e "\nWarning : When using unpaired data with Salmon, you might need to specify mean fragment length and fragment length standard deviation (--frag-length and --frag-sd options)"
fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
fi
else
fastq=" -l IU -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq" #see http://salmon.readthedocs.org/en/latest/library_type.html#fraglibtype
fi
salmon --no-version-check quant $ADDOPT --threads $PROCESSORS -i $out/$toolidx -o $fileout.quant $fastq ;
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0;}; close(refseqs) }
{ if(NR > 11) ctg[$1] = $3 } END{print "Contig", reads; for (i in ctg) print i, ctg[i]}' $fileout.quant/quant.sf > $fileout
;;
# S) if [ $MODE == "u" ]
# then
# if [ $FRAGLENGTH == 'none' ] || [ $FRAGSD == 'none' ]; then
# echo -e "\nWarning : When using unpaired data with Salmon, you might need to specify mean fragment length and fragment length standard deviation (--frag-length and --frag-sd options)"
# fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
# else
# fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
# fi
# else
# fastq=" -l IU -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq" #see http://salmon.readthedocs.org/en/latest/library_type.html#fraglibtype
# fi
# salmon --no-version-check quant $ADDOPT --threads $PROCESSORS -i $out/$toolidx -o $fileout.quant $fastq ;
# awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0;}; close(refseqs) }
# { if(NR > 11) ctg[$1] = $3 } END{print "Contig", reads; for (i in ctg) print i, ctg[i]}' $fileout.quant/quant.sf > $fileout
# ;;
R) if [ $MODE == "u" ]
then
fastq=" -r "$INDIR"/"${fasta_array[$k]}".fastq"
......@@ -163,29 +163,26 @@ if [ $RECAT == "no" ]; then
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
H) if [ $MODE == "u" ]
then
fastq=" -fq="$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" --fq="$INDIR"/"${fasta_array[$k]}".L.fastq --fq2="$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
mkdir $fileout.hpg;
hpg-aligner dna --report-n-best=1 $ADDOPT --cpu-threads=$PROCESSORS -i=$out/$toolidx $fastq -o $fileout.hpg
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' $fileout.hpg > $fileout
;;
# H) if [ $MODE == "u" ]
# then
# fastq=" -fq="$INDIR"/"${fasta_array[$k]}".fastq"
# else
# fastq=" --fq="$INDIR"/"${fasta_array[$k]}".L.fastq --fq2="$INDIR"/"${fasta_array[$k]}".R.fastq"
# fi
# mkdir $fileout.hpg;
# hpg-aligner dna --report-n-best=1 $ADDOPT --cpu-threads=$PROCESSORS -i=$out/$toolidx $fastq -o $fileout.hpg
# awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
# END{
# print "Contig", reads;
# for (i in ctg) totRPK += ctg[i]/ctgsize[i];
# for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' $fileout.hpg > $fileout
# ;;
esac
if [ -f $finalout ]
then
#join -t $'\t' --header $finalout $fileout > $finalout'.tmp'
paste $finalout $fileout | awk -F 'FS' 'BEGIN{FS="\t"}{for (i=1; i<=NF-1; i++) if(i!=NF-1) {printf $i FS};{print $NF}}' > $finalout'.tmp'
# => K = bon ordre pour tout le monde
# => B = les *.out sont dans le bon ordre
mv $finalout'.tmp' $finalout
else
cat $fileout > $finalout
......@@ -289,7 +286,7 @@ if [ $RECAT == "no" ]; then
### RE-CATEGORIZATION ONLY
elif [ $RECAT != "no" ]; then
# preparing re-categorization
i=0
for allfile in $RECAT/*.all ; do
fasta_array[$i]=`basename $allfile .all`
......@@ -301,12 +298,12 @@ elif [ $RECAT != "no" ]; then
i=$(( i + 1 ))
done
# re-categorizing transcipts (clean, contam, dubious, lowcov, overexp)
for (( j=0; j <i; j++ )); do
ref=${fasta_array[$j]};
refseqs=$out/$ref".ctgs"
finalout=$out/$ref".all"
echo -e "Re-categorizing $ref transcripts"
# re-categorizing transcipts (clean, contam, dubious, lowcov, overexp)
awk -v outDir=$out -v ref=$ref -v col=$j -v fold=$FOLD -v mincov=$MINCOV -v overexp=$OVEREXP 'BEGIN{OFS="\t";col=col+2;}
{ if (NR == 1) {
print $0, "MaxOtherSpCov", "log2FoldChange", "Status" > outDir"/"ref".tmp"
......
#!/bin/bash
# This tool will analyse all the .fasta files in the current directory
# The reads files must be in the same dir and must follow this naming convention
# FOR paired end: NAME.fasta, NAME.L.fastq, NAME.R.fastq
# FOR unpaired : NAME.fasta, NAME.fastq
my_dir="$(dirname "$0")"
source "$my_dir/path_management.sh"
source "$my_dir/MacLin_management.sh"
############################ DEFAULT VALUES ############################
source "$my_dir/def_values.sh"
############################ CROCO USAGE ############################
source "$my_dir/parse_params.sh"
# summary of settings
source "$my_dir/print_settings.sh"
# setting output directory
if [ $RECAT == "no" ]; then
out="$INDIR/${OUTPUTPREFIX}CroCo-"$TOOL"-id"$SUSPID"-len"$SUSPLEN"-fold"$FOLD"-mincov"$MINCOV"-"`date +"%Y_%m_%d-%I_%M"`
echo "Output directory : $out"
echo
mkdir $out
elif [ $RECAT != "yes" ]; then
out=$INDIR/${OUTPUTPREFIX}CroCo"_RECAT-fold"$FOLD"-mincov"$MINCOV"-"`date +"%Y_%m_%d-%I_%M"`
echo "Output directory : $out"
echo
mkdir $out
fi
############################ PROGRAM BEGINS HERE ############################
START_TIME=$SECONDS
### run BLAST, mapping, and quantification if recat function is off (default)
if [ $RECAT == "no" ]; then
### setting sample names
i=0
for fasta in $INDIR/*.fasta ; do
fasta_array[$i]=`basename $fasta .fasta`
i=$(( i + 1 ))
done
### detecting suspect transcripts with BLAST
source "$my_dir/detect_suspect_trans_with_blast.sh"
### intermediate timing
ELAPSED_TIME=$(($SECONDS - $START_TIME))
h=$(($ELAPSED_TIME/3600))
m=$((($ELAPSED_TIME%3600)/60))
s=$(($ELAPSED_TIME%60))
echo -e "\nExecution time for detecting suspect transcripts : $h h $m m $s s"
echo
START_TIME2=$SECONDS
### index contigs for the selected tool
case "$TOOL" in
B) toolidx="ALL_transcripts_bowtie_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
B2) toolidx="ALL_transcripts_bowtie2_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie2-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
K) toolidx="ALL_transcripts_kallisto_index"; if [ ! -f $out/$toolidx ]; then kallisto index -i $out/$toolidx $out/ALL_transcripts.fasta; fi ;;
S) toolidx="ALL_transcripts_salmon_index" ; if [ ! -d $out/$toolidx ]; then salmon --no-version-check index -t $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
R) toolidx="ALL_transcripts_rapmap_index" ; if [ ! -d $out/$toolidx ]; then rapmap quasiindex -t $out/ALL_transcripts.fasta -p -x $PROCESSORS -i $out/$toolidx; fi ;;
H) toolidx="ALL_transcripts_hpg_index" ; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; hpg-aligner build-sa-index -g $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
esac
echo -e "\nIndex built : $out/$toolidx\n"
for (( j=0; j <i; j++ ))
do
ref=${fasta_array[$j]};
refseqs=$out/$ref".ctgs"
echo -e "Getting length of $ref transcripts"
awk -v ref=$ref -v refseqs=$refseqs 'BEGIN{RS=">"; FS="\t"}
NR>1 {
sub("\n","\t",$0);
gsub("\n","",$0);
ident =split($1,a," ") # N.B the seq idents (supr all text after the first space)
Seqlength = length($2) ;
print a[1]"\t"Seqlength > refseqs
}' $out/$ref".fasta_mod"
done
# regroup all samples
refseqALL=$out/ALL_transcripts.ctgs
cat $out/*.ctgs > $refseqALL
# mapping successively every read sets on all transcripts.
for (( k=0; k <i; k++ ))
do
reads=${fasta_array[$k]}"_reads"
fileout=$out/${fasta_array[$k]}"_vs_ALL.out"
finalout=$out/"ALL_transcripts.all"
echo -e "\nMapping ${fasta_array[$k]} reads"
# calculate expression level with selected tool
case "$TOOL" in
B) if [ $MODE == "u" ]
then
fastq=$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
bowtie -p $PROCESSORS $ADDOPT -a --trim5 $TRIM5 --trim3 $TRIM3 --chunkmbs 2000 --suppress 1,2,4,5,6,7,8 $out/$toolidx/$toolidx $fastq | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
B2) if [ $MODE == "u" ]
then
fastq=" -U "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
bowtie2 -p $PROCESSORS --no-unal --no-head $ADDOPT -a --quiet --omit-sec-seq --trim5 $TRIM5 --trim3 $TRIM3 -x $out/$toolidx/$toolidx $fastq | grep -v "@" | cut -f3 | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
K) if [ $MODE == "u" ]
then
if [ $FRAGLENGTH == 'none' ] || [ $FRAGSD == 'none' ]; then
echo -e "\nWarning : When using unpaired data with Kallisto, you need to specify mean fragment length and fragment length standard deviation (--frag-length and --frag-sd options)"
fastq=" --single -l $FRAGLENGTH -s $FRAGSD "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" --single -l $FRAGLENGTH -s $FRAGSD "$INDIR"/"${fasta_array[$k]}".fastq"
fi
else
fastq=$INDIR"/"${fasta_array[$k]}".L.fastq "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
kallisto quant $ADDOPT --threads=$PROCESSORS -i $out/$toolidx -o $fileout.quant $fastq ;
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0;}; close(refseqs) }
{ if(NR>1) ctg[$1] = $5 } END{print "Contig", reads; for (i in ctg) print i, ctg[i]}' $fileout.quant/abundance.tsv > $fileout
;;
S) if [ $MODE == "u" ]
then
if [ $FRAGLENGTH == 'none' ] || [ $FRAGSD == 'none' ]; then
echo -e "\nWarning : When using unpaired data with Salmon, you might need to specify mean fragment length and fragment length standard deviation (--frag-length and --frag-sd options)"
fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" --fldMean $FRAGLENGTH --fldSD $FRAGSD -l U -r "$INDIR"/"${fasta_array[$k]}".fastq"
fi
else
fastq=" -l IU -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq" #see http://salmon.readthedocs.org/en/latest/library_type.html#fraglibtype
fi
salmon --no-version-check quant $ADDOPT --threads $PROCESSORS -i $out/$toolidx -o $fileout.quant $fastq ;
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0;}; close(refseqs) }
{ if(NR > 11) ctg[$1] = $3 } END{print "Contig", reads; for (i in ctg) print i, ctg[i]}' $fileout.quant/quant.sf > $fileout
;;
R) if [ $MODE == "u" ]
then
fastq=" -r "$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
rapmap quasimap -t $PROCESSORS $ADDOPT -i $out/$toolidx $fastq | grep -v "@" | cut -f3 | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
H) if [ $MODE == "u" ]
then
fastq=" -fq="$INDIR"/"${fasta_array[$k]}".fastq"
else
fastq=" --fq="$INDIR"/"${fasta_array[$k]}".L.fastq --fq2="$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
mkdir $fileout.hpg;
hpg-aligner dna --report-n-best=1 $ADDOPT --cpu-threads=$PROCESSORS -i=$out/$toolidx $fastq -o $fileout.hpg
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' $fileout.hpg > $fileout
;;
esac
if [ -f $finalout ]
then
#join -t $'\t' --header $finalout $fileout > $finalout'.tmp'
paste $finalout $fileout | awk -F 'FS' 'BEGIN{FS="\t"}{for (i=1; i<=NF-1; i++) if(i!=NF-1) {printf $i FS};{print $NF}}' > $finalout'.tmp'
# => K = bon ordre pour tout le monde
# => B = les *.out sont dans le bon ordre
mv $finalout'.tmp' $finalout
else
cat $fileout > $finalout
fi
cp $finalout $out/All_transcripts.quants
done