Commit dc23b0ce authored by khalid's avatar khalid
Browse files

With Polo changes

parent 6a48544b
......@@ -5,13 +5,12 @@
CroCo : a program to detect and remove cross contaminations in assembled transcriptomes
=================
#### **Paul Simion, Khalid Belkhir, Hervé Philippe, Clémentine François, Julien Veyssier, Jochen Rink, Michaël Manuel, Max Telford**
1) Institut des Sciences de l'Evolution (ISEM) - UMR 5554 - CNRS, EPHE, IRD - Université de Montpellier, Montpellier, France
2)
3) Institut Biologie Paris-Seine (IBPS) - UMR 7138 - CNRS - Université Pierre et Marie Curie, Paris, France
4)
5)
6)
#### **Paul Simion, Khalid Belkhir, Clémentine François, Julien Veyssier, Jochen Rink, Michaël Manuel, Hervé Philippe, Max Telford**
1) Institut des Sciences de l'Evolution (ISEM), UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France
2) Max Plank Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307 Dresden, Germany
3) Sorbonne Universités, UPMC Univ Paris 06, CNRS, Evolution Paris-Seine UMR7138, Institut de Biologie Paris-Seine, Case 05, 7 quai St Bernard, 75005 Paris, France
4) Centre de Théorisation et de Modélisation de la Biodiversité, Station d'Ecologie Théorique et Expérimentale, UMR CNRS 5321, Moulis, 09200, France & Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, H3C 3J7 Québec, Canada
5) University College London, Department of Genetics, Evolution and Environment, Darwin Building, Gower Street, London WC1E 6BT, UK
<br /><br />
......@@ -22,17 +21,17 @@ CroCo : a program to detect and remove cross contaminations in assembled transcr
CroCo is a program to detect cross contaminations in assembled transcriptomes using sequencing reads to determine the true origin of every transcripts.
Such cross contaminations can be expected if several RNA-Seq experiments were prepared during the same period at the same lab, or by the same people, or if they were processed or sequenced by the same sequencing service company.
Our approach first determines a subset of transcripts that are suspiciously similar across samples using a BLAST procedure, and then quantifies the expression level of these transcripts in all samples (e.g. several species sequenced by the same lab for a particular study).
This can be done with various mapping tools implemented in CroCo (we use bowtie as default since we recommend favouring mapping accuracy above run speed).
CroCo then uses that information to categorize each transcript in the following 5 categories :
Our approach first determines a subset of transcripts that are suspiciously similar across samples using a BLAST procedure. CroCo then combine all transcriptomes into a metatranscriptome and quantifies the "expression level" of all transcripts successively using every sample read data (e.g. several species sequenced by the same lab for a particular study) while allowing read multi-mappings.
Several mapping tools implemented in CroCo can be used to estimate expression level (default is RapMap).
This information is then used to categorize each transcript in the following 5 categories :
- **clean**: the transcript origin is from the expected sample.
- **cross contamination**: the transcript origin is from an unexpected sample of the same experiment.
- **dubious**: expression levels are too close between expected and unexpected samples to determine the true origin of the transcript.
- **clean**: the transcript origin is from the focal sample.
- **cross contamination**: the transcript origin is from an alien sample of the same experiment.
- **dubious**: expression levels are too close between focal and alien samples to determine the true origin of the transcript.
- **low coverage**: expression levels are too low in all samples, thus hampering our procedure (which relies on differential expression) to confidently assign it to any category.
- **over expressed**: expression levels are very high in more than 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (such as *Escherichia coli* contaminations).
- **over expressed**: expression levels are very high in at least 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (e.g. *Escherichia coli* contaminations).
CroCo outputs cross contaminations statistics, up to the five categorized transcriptome fasta files per sample, and several graphical networks of ascertained cross contaminations as well as for dubious cases.
CroCo outputs cross contamination statistics, quantifications of suspect transcripts in all samples, optionally up to the five categorized transcriptome fasta files per sample, and optionally several graphical networks of ascertained cross contaminations and dubious cases.
<p align="center">
<img src="utils/misc/CroCo_categories_300p.png" alt="Drawing" width="400"/>
......@@ -41,25 +40,25 @@ CroCo outputs cross contaminations statistics, up to the five categorized transc
---
**Important Note - NGS data type**
**Important Note - NGS data type and quality**
CroCo has been designed for analyzing RNA-Seq data as our approach uses differential expression to detect cross contaminations and thus can only work with sequencing data that informs on expression level.
CroCo has been designed with RNA-Seq data in mind as our approach uses differential expression to detect cross contaminations and can only work with sequencing data that bear information on expression level.
Illumina data are therefore currently the best-suited type of RNA-Seq data for CroCo, but other types of transcriptomic data (such as 454 or PacBio) can also be used
in which case the user should adjust some of the parameters (see [Detailed options](#detailed-options))
Fundamentaly, the only requirement for CroCo to be effective is that sequence coverage must be correlated to the quantity of molecules in the original sample.
Fundamentaly, the only requirement for CroCo to be effective is that sequence coverage must be correlated to the quantity of molecules in the original sample. In a near future, we will work on the use of our approach to detect cross contamination in genomic data.
Additionaly, and regardless of the sequencing technology used, quality of the assembled transcriptomes used as input for CroCo is important to maximize the accuracy of our approach. For example, it is expected that transcriptome redundancy such as natural transcripts variants could potentially create false positive results since the variants might likely display different expression levels.
---
**Important Note - closely-related samples**
CroCo's procedure make use of sequence similarity to detect potential cross contaminations events.
CroCo's procedure make use of sequence similarity to detect potential cross contaminations events and to quantify their abundance.
As a result, our approach will loose power if too closely-related samples are analysed,
leading to a potential over-estimation of the number of cross contaminations, of dubious cases and of over-expressed cases. This seems to be particularly true when using fast quantification tools (such as *Salmon* or *Kallisto*)
as opposed to slower mapping tools such as *Bowtie*.
As a rule of thumb, we suggest not to analyse samples with less than ~4% sequence divergence.
Therefore, we do not recommend using CroCo in the context of population studies.
leading to a potential over-estimation of the number of cross contaminations, of dubious cases and of over-expressed cases.
We therefore recommend not to analyse samples with very low sequence divergence levels (e.g. 1% divergence), such as in the context of population studies or for tumoral vs. normal cell comparisons.
Note, however, that the categorization of transcripts as clean will always be correct : you would only be discarding more transcripts than necessary.
Note, however, that even in such cases, the categorization of transcripts as clean will always be correct : you would only be discarding more transcripts than necessary.
---
......@@ -84,7 +83,6 @@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
5. [Outputs](#outputs)
6. [Detailed options](#detailed-options)
7. [Troubleshooting](#troubleshooting)
8. [Future Developments](#future-developments)
<div class="page-break"></div>
......@@ -97,7 +95,7 @@ Then go to the section below that corresponds to your Operating System.
**Linux & Mac OS X**
CroCo is a BASH script written under Linux that uses BLAST+, mapping softwares (e.g. *bowtie*, *Salmon*) and do not require any installation : **You can immediately use CroCo If you already have both the BLAST+ suite and the mapping tool you want to use** (e.g. *bowtie*) **installed on your system and present in your PATH**.
CroCo is a BASH script written under Linux that uses BLAST+, mapping tools (e.g. *bowtie*, *RapMap*) and do not require any installation : **You can immediately use CroCo If you already have both the BLAST+ suite and the mapping tool you want to use installed on your system and present in your PATH**.
If you lack one or several of these tools, we provide a bash script (`install_dependencies.sh`) that will automatically install them for you. To do so, extract CroCo, move into its `utils/` directory and install the dependencies you want.
......@@ -109,17 +107,15 @@ cd XXXXXXXXXXXX/utils
bash ./install_dependencies.sh --tool all --os ubuntu
```
Here to install only the mapper Kallisto on a MAC OS X:
Here to only install the mapper Kallisto on a MAC OS X:
```bash
unzip XXXXXXXXXXXX
cd XXXXXXXXXXXX/utils
bash ./install_dependencies.sh --tool K --os macosx
```
We recommend to install all these dependencies automatically even if some of these softwares are already installed on your computer. Indeed, they will be installed within CroCo's directory and will not affect in any way the use of the softwares already installed on your system. This will assure you that you are using versions of these softwares that are compatible with the current version of CroCo.
We recommend to install all these dependencies automatically even if some of these softwares are already installed on your computer. Indeed, this will assure you that you are using versions of these softwares that are compatible with the current version of CroCo. Do not worry, they will be installed only within CroCo's directory and will not affect in any way the use of the softwares already installed on your system.
We now recommend you to have a look at the [Optional requirements](#optional-requirements) and then to proceed to the [Installation tests](#installation-tests) to make sure that CroCo runs correctly on your system.
Now, you should have a look at the [Optional requirements](#optional-requirements) and then should proceed to the [Installation tests](#installation-tests) to make sure that CroCo runs correctly on your system.
---
......@@ -153,10 +149,8 @@ As CroCo is a bash script, and therefore do not require to be installed. However
**Mandatory dependencies - at least one mapping tool in the following list** :
- Bowtie-1.1.2
- Bowtie2-2.2.9
- Kallisto-0.43.0
- Salmon-0.6.0
- Rapmap-0.3.0
- Rapmap-0.1.0
The install script (`install_dependencies.sh`) let you install all these dependencies automatically.
It can be used to install only one tool at a time (e.g. `--tool B`), or all in once (i.e. `--tool all`).
......@@ -173,7 +167,7 @@ If you encounter problems during dependencies installation, take a look at the [
---
**Notes for Linux users**
The installation script will install various libraries (or update them if already present) and will notably install `cmake` in order to be able to install *Salmon* and *RapMap*.
The installation script will install various libraries (or update them if already present) and will notably install `cmake` in order to be able to install *RapMap*. This requires an internet connexion.
**Notes for MAC OS X users**
The installation script will install (or update if already present) several libraries as well as `Brew`, which will then be used to install necessary linux-like version of several functions used in CroCo and in the install_dependencies script itself (such as getopt, cmake, g++, awk, etc...).
......@@ -181,14 +175,14 @@ For more information on this necessary step, please see the following links :
[https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/](https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/)
[http://apple.stackexchange.com/questions/69223/how-to-replace-mac-os-x-utilities-with-gnu-core-utilities](http://apple.stackexchange.com/questions/69223/how-to-replace-mac-os-x-utilities-with-gnu-core-utilities)
**You can also install the tools manually**, without using our install script. The only important thing is to make sure they are in the PATH when you launch CroCo. CroCo will first look for the tools within its own directory in `utils/bin`, then in the PATH.
**You are also free to install CroCo dependencies manually**, without using our install script. The only important thing is to make sure they are in the PATH when you launch CroCo. CroCo will first look for the tools within its own directory in `utils/bin`, then in the PATH.
## Using CroCo through Docker
Using Docker will allow the creation of an image of CroCo self-containing all its dependencies.
It is therefore unecessary to install any dependencies, but you are required to download and install Docker on your system (see [here](https://docs.docker.com)).
This install method notably enables Windows users to use CroCo.
This install method notably enables Windows users to use CroCo but is also available for other OS.
Now, move into the `utils/CroCo_dockerbuild` directory within CroCo and then use Docker to build an image of CroCo, as follows :
```bash
......@@ -205,8 +199,7 @@ You will then be able to directly launch the CroCo image you just built through
**Adding CroCo in your PATH**
It is good practice to add CroCo's location in your PATH, which allow you to use CroCo from any location in your system.
This is done by using this command:
It is good practice to add CroCo's location in your PATH, which allow you to use CroCo from any location in your system. This is done by using this command:
```bash
export PATH=$PATH:/my/complete/path/to/CroCo/directory/src
......@@ -305,17 +298,18 @@ CroCo will create a directory containing all results which will be placed within
Usage :
CroCo_v0.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum-coverage FLOAT] [--threads INT] [--output-prefix STR] [--output-level 1|2|3] [--graph yes|no] [--trim5 INT] [--trim3 INT] [--frag-length FLOAT] [--frag-sd FLOAT] [--suspect-id INT] [--suspect-len INT] [--add-option STR] [--recat STR]
--mode p|u :\t 'p' for paired and 'u' for unpaired (default : 'p') [short: -m]
--in STR :\t Name of the directory containing the fasta files to be analyzed (DEFAULT : working directory) [short: -i]
--tool B|B2|K|R|S|H :\t\t'B' for bowtie, 'B2' for bowtie2, 'K' for kallisto, 'S' for salmon, 'R' for rapmap (DEFAULT : 'B') [short: -t]
--mode p|u :\t\t\t'p' for paired and 'u' for unpaired (default : 'p') [short: -m]
--in STR :\t\t\tName of the directory containing the input files to be analyzed (DEFAULT : working directory) [short: -i]
--tool B|K|R :\t\t'B' for bowtie, 'K' for kallisto, 'R' for rapmap (DEFAULT : 'R') [short: -t]
--fold-threshold FLOAT :\tValue between 1 and N (DEFAULT : 2) [short: -f]
--minimum-coverage FLOAT :\tValue in TPM (DEFAULT : 0.2) [short: -c]
--minimum-coverage FLOAT :\tTPM value (DEFAULT : 0.2) [short: -c]
--overexp FLOAT :\t\t\tTPM value (DEFAULT : 300) [short: -d]
--threads INT :\t\t\tNumber of threads to use (DEFAULT : 1) [short: -n]
--output-prefix STR :\t\tPrefix of output directory that will be created (DEFAULT : empty) [short: -p]
--output-level 1|2|3 :\t\tSelect the fasta files to output. '1' for none, '2' for clean and lowcov, '3' for all (DEFAULT : 2) [short: -l]
--graph yes|no :\t\tProduce graphical output using R (DEFAULT : no) [short: -g]
--add-option STR :\t\tThis text string will be understood as additional options for the mapper/quantifier used (DEFAULT : empty) [short: -a]
--recat SRT :\t\t\tName of the previous CroCo output folder of which you wish to re-categorize transcripts (DEFAULT : no) [short: -r]
--add-option 'STR' :\t\tThis text string will be understood as additional options for the mapper/quantifier used (DEFAULT : empty) [short: -a]
--recat SRT :\t\t\tName of a previous CroCo output directory you wish to use to re-categorize transcripts (DEFAULT : no) [short: -r]
--trim5 INT :\t\t\tnb bases trimmed from 5' (DEFAULT : 0) [short: -x]
--trim3 INT :\t\t\tnb bases trimmed from 3' (DEFAULT : 0) [short: -y]
--suspect-id INT :\t\tIndicate the minimum percent identity between two transcripts to suspect a cross contamination (DEFAULT : 95) [short: -s]
......@@ -324,19 +318,19 @@ CroCo_v0.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum
--frag-sd FLOAT :\t\tEstimated standard deviation of fragment length (no default value). Only used in specific combinations of --mode and --tool [short: -v]
It is good practice to redirect information about each CroCo run into an output log file using the following structure :
'| tee log_file'
'2>&1 | tee log_file'
Minimal working example :
CroCo_v0.1.sh --mode p | tee log_file
CroCo_v0.1.sh --mode p 2>&1 | tee log_file
Exhaustive example :
CroCo_v0.1.sh --mode p --in data_folder_name --tool B --fold-threshold 2 --minimum-coverage 0.2 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no | tee log_file
CroCo_v0.1.sh --mode p --in data_folder_name --tool B --fold-threshold 2 --minimum-coverage 0.2 --overexp 300 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no 2>&1 | tee log_file
Exhaustive example using shortcuts :
CroCo_v0.1.sh -m p -i data_folder_name -t B -f 2 -c 0.2 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no | tee log_file
CroCo_v0.1.sh -m p -i data_folder_name -t B -f 2 -c 0.2 -d 300 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no 2>&1 | tee log_file
Example for re-categorizing previous CroCo results
CroCo_v0.1.sh -i data_folder_name -r previous_CroCo_results_folder_name -f 10 -c 0.5 -g yes | tee log_file
CroCo_v0.1.sh -i data_folder_name -r previous_CroCo_results_folder_name -f 10 -c 0.5 -g yes 2>&1 | tee log_file
```
......@@ -348,9 +342,9 @@ If you launch CroCo through Docker, you will need to add the following before th
```bash
# for Windows
docker run -v /Users/path/where/data/are:/CroCoData /bin/bash [classic usage]
docker run -v /Users/path/where/data/are:/CroCoData /bin/bash [+ classic usage]
# for Linux and Mac OS X
docker run -t -i -P -v /home/user/where/data/are:/CroCoData:rw crocodock /bin/bash [classic usage]
docker run -t -i -P -v /home/user/where/data/are:/CroCoData:rw crocodock /bin/bash [+ classic usage]
```
......@@ -359,8 +353,8 @@ docker run -t -i -P -v /home/user/where/data/are:/CroCoData:rw crocodock /bin/b
# Inputs
The transcriptomes and raw data to be analyzed must be present in a given directory indicated by the user with the option `--in` (by default CroCo will look for them in the current directory).
Also, CroCo requires file names unity between transcriptomes and raw reads as follows:
The transcriptomes and raw data to be analyzed must be present in a given directory indicated by the user with the option `--in` (by default, CroCo will look for them in the current directory).
Also, **CroCo requires file names unity between transcriptomes and raw reads** as follows:
**for paired-end reads** :
NAME.fasta (assembled transcriptome seqs)
......@@ -401,13 +395,13 @@ Two files will also be created by CroCo :
**Categorized transcriptomes**
Up to five fasta files corresponding to the five transcript categories will be created (depending on the parameter value set for the `--output-level` option).
Up to five fasta files corresponding to the five transcript categories will be created (depending on the parameter value set for the `--output-level` option, which by default output none of these files).
Recommendations for downstream analyses :
- ONLY use the *clean* transcripts for direct qualitative analyses, such as phylogenetic inferences, or presence/absence observations. You might miss some data, but you won't miss rigour.
- ONLY use the *clean* transcripts for direct qualitative analyses, such as phylogenetic inferences, or presence/absence observations. There is a slight risk of missing some correct data, but you won't miss rigour.
- use both *clean* and *low coverage* transcripts if downstream analyses can adequatly detect and circumvent potential cross contaminations (such as a refined orthology assignment step).
- If necessary, scavenge transcripts from *dubious* category on a case-by-case basis, always checking for their true origin (e.g. BLASTing them on NCBI non-redundant nucleotidic database is a good start). If still in doubt, discard them.
- use *overexpressed* category on a case-by-case basis, as they are strongly expressed in several samples, which means they might stem from highly conserved genes of which it might not be trivial to determine the exact taxonomical origin. They could also come from external contamination shared by several samples. Users might want to evaluate these transcripts with other tools, such as [Busco](http://busco.ezlab.org/). Note that if you only analyze two samples, no transcript will be categorized as *overexpressed* as it requires that the transcript is highly expressed in at least three samples.
- use *overexpressed* category on a case-by-case basis, as they are strongly expressed in several samples, which means they might stem from highly conserved genes of which it might not be trivial to determine the exact taxonomical origin. They could also come from external contamination shared by several samples. Users might want to evaluate these transcripts with other tools, such as [Busco](http://busco.ezlab.org/). Note that if you only analyze two samples, no transcript will ever be categorized as *overexpressed* as it requires that the transcript is highly expressed in at least three samples.
---
......@@ -419,9 +413,9 @@ This option requires a working installation of R and the installation of two R l
A first network corresponds to the exhaustive cross contamination patterns (i.e. *network_complete.html*)
while a second one is an arbitrarily simplified version of the same network in which all links representing less than 2% of the biggest link are ignored (i.e. *network_simplified.html*).
This simplification is useful for visualizing large networks when their is strong discrepancies between cross contamination links.
The last two networks are the conterparts of those described above for dubious cross contamination cases.
The other two networks created are the conterparts of those described above for dubious cross contamination cases (i.e. *network_dubious_complete.html* and *network_dubious_simplified.html*).
To view these dynamic networks, simply open the corresponding html file with any internet browser (e.g. Firefox, Chromium). The network nodes can be selected (it highlights them and their associated cross contaminations) and moved around as well. Here is an example command line to open the network with firefox :
To view these dynamic networks, simply open the corresponding html file with any internet browser (e.g. Firefox, Chromium). The network nodes can be selected (it then highlights them and their associated cross contaminations) and moved around as well. Here is an example command line to open the network with firefox :
```bash
firefox network_simplified.html &
......@@ -429,7 +423,7 @@ firefox network_simplified.html &
Please note that if you want to move these html files elsewhere (e.g. to store CroCo results),
you'll also need to move along their associated folders,
respectively named `network_complete_files` and `network_simplified_files`, as the html files need them to display the networks.
respectively named *network_complete_files*, *network_simplified_files*, *network_dubious_complete_files* and *network_dubious_simplified_files* as the html files need them to display the networks.
------
......@@ -440,7 +434,7 @@ respectively named `network_complete_files` and `network_simplified_files`, as t
This parameter specify if raw data is paired-end (`p`) or single-end (unpaired, `u`).
Don't forget to adjust the input file names accordingly : NAME.fastq for unpaired data, NAME.L.fastq + NAME.R.fastq for paired-end data.
IMPORTANT : if `--mode` is set to *unpaired* AND the tool used is either *Salmon* or *Kallisto*, then the options `--frag-length` and `--frag-sd` are required.
IMPORTANT : if `--mode` is set to *unpaired* AND the tool used is *Kallisto*, then the options `--frag-length` and `--frag-sd` are required.
---
......@@ -467,12 +461,11 @@ to consider its query transcript *suspect* (default is `40` nucleotids).
**--tool** (-t)
This allows you to choose the mapper/quantifier to use from the following list : bowtie (`B`, default), bowtie2 (`B2`) , kallisto (`K`), salmon (`S`) and rapmap (`R`).
This allows you to choose the mapper/quantifier to use from the following list : bowtie (`B`), kallisto (`K`), and rapmap (`R`, default).
IMPORTANT : These tools use different approaches to map and to quantify reads resulting in possibly very different levels of precision and speed. The accuracy of CroCo relies on the accuracy of the tool selected.
The decision to use bowtie as default is based on the observation that its computationally heavier pairwise alignment strategy leads to higher accuracy when comparing samples closely-related.
If analysis speed is a criterion of importance for you, we then recommend Salmon as the best trade-off between accuracy and speed.
The decision to use RapMap as default is based on analyses of simulated data which suggested that using this mapping tool lead to higher accuracy when comparing samples closely-related.
---
......@@ -492,7 +485,7 @@ The side effect will be that more (possibly many) transcripts will end up catego
**--minimum-coverage** (-c)
Indicates the expression level in Transcripts Per Million (TPM) under which the transcript will be categorized neither as "clean", "contamination" nor "dubious, but will instead be categorized as a "low coverage" transcript.
Indicates an expression level in Transcripts Per Million (TPM). If a transcript has lower TPM value than this threshold in all samples it will then be categorized as a "low coverage" transcript.
This "low coverage" category reflects CroCo using a quantitative approach. If not enough information is given to it, it will not have anough resolution power to determine with certainty the true source of a transcript.
......@@ -504,6 +497,14 @@ We recommend to stay somewhere within `0.1` and `10`, depending on you sequencin
---
**--overexp** (-d)
Indicates an expression level in Transcripts Per Million (TPM). It a transcript has higher value that this threshold for at least three samples, it will then be categorized as "overexpressed". Its default value is `300` TPM.
Overexpressed transcripts will likely corresponds to highly conserved genes or external contaminations shared by multiple samples under study.
---
**--threads** (-n)
Allow the user to specify the number of parallel threads to be used by CroCo (default = `1`).
......@@ -538,19 +539,12 @@ In case you would like to see the cross contamination network of a previous CroC
This parameter allows experimented users to specify additional options to the mapper/quantifier tool used (default is empty).
It is possible this way to change these tools default parameter values according to your type of data in order to improve mapping/quantification efficiency.
*Suggestions*:
If you are analyzing 454 reads, you might want to use a less stringeant mapping to handle their more frequent sequencing errors and frameshifts. In that case you might want to use Bowtie2 with this additional option :
```bash
bash CroCo_v0.1.sh --mode u --tool B2 --add-option '--very-sensitive-local'
```
This time, if you want to increase Bowtie precision to only allow a maximum of 1 mismatch per alignment :
*Example*:
Assuming, you want to increase Bowtie precision by only allowing a maximum of 1 mismatch per alignment :
```bash
bash CroCo_v0.1.sh --mode u --tool B --add-option '-v 1'
```
Lastly, if you want to increase Salmon sensitivity :
```bash
bash CroCo_v0.1.sh --mode u --tool S --add-option '--useVBOpt --extraSensitive'
```
---
......@@ -566,21 +560,21 @@ bash CroCo_v0.1.sh --in data_location --recat data_location/CroCo-B-id95-len40-f
Recategorization is very fast as neither BLAST nor mapping steps need to be computed anew.
This `--recat` switch allows the user to quickly try several categorization parameterizations.
Note that if you want to try different parameterizations for the BLAST step,
then you will be required to run a new complete CroCo analysis.
then you will be required to run a complete new CroCo analysis.
---
**--frag-length** (-u)
This optional parameter indicates the estimated mean length of fragments that where then sequenced.
This option must only be used in particular CroCo set up : using *unpaired* reads with *Kallisto* or *Salmon*.
This option must only be used in particular CroCo set up : using *unpaired* reads with *Kallisto*.
---
**--frag-sd** (-v)
This optional parameter indicates the estimated standard deviation of fragments length.
This option must only be used in particular CroCo set up : using *unpaired* reads with *Kallisto* or *Salmon*.
This option must only be used in particular CroCo set up : using *unpaired* reads with *Kallisto*.
---
......@@ -595,24 +589,7 @@ Default values for these two parameters are `0`.
**During the installation of CroCo**
*this one should not exist anymore*
Error message : `You must install make, cmake and one of clang and g++ to compile RapMap`
This is common problem of installing RapMap on MacOSX.
As using RapMap is not mandatory for using CroCo, we might ignore this problem and continue with the installation of the other dependencies, as follows:
```bash
bash ./install_dependencies.sh --tool S --os macosx
bash ./install_dependencies.sh --tool K --os macosx
bash ./install_dependencies.sh --tool BL --os macosx
```
---
*this one should not exist anymore*
Error message : `install_dependencies.sh: line 325: wget: command not found`
---
*RapMap and Salmon installations failed*
*RapMap installations failed*
This is likely because `cmake` is not installed (or not up-to-date) on your system.
Normally, the `install_dependencies.sh` script should install or update `cmake` automatically,
but it can only work with an internet connexion.
......@@ -634,98 +611,12 @@ This might simply be a problem with the CRAN mirror you used to download and ins
**During CroCo run**
*Problems with contig names in transcriptomes*
*Problems with contig names in transcriptomes (1)*
Error message : `Error: [blastdbcmd] contig_name: OID not found`
This message indicates that the `contig_name` is not recognised by BLAST+, probably because of the presence of special characters. Try to replace them with another character, such as an underscore.
---
*this one should not exist anymore*
*Problems with contig names in transcriptomes (2)*
Error message : `Warning: (1431.1) CFastaReader: Title is very long: XXXX characters (max is 1000), at line xxxxxxxx`
This message indicates that the sequence name before the first spacing character is too long (more than 1000 characters). You need to shorten it.
------
# Future Developments
**Use of closely-related samples to ignore them !!**
Allowing the user to provide a file containing pair of species of which cross contam will not be evaluated.
This would allow to use very closely-related samples in a broader analysis while ignoring the otherwise detected crosscontam between this specific pair. This would allow to decontaminate these two samples from their respective cross contam (that they do not necessarily share!)
implementation : while categorizing *.all* results, if the best unexpected source is listed as a *paired* sample, we look at the second best unexpected source, etc...
**Exporting the network in a vectorial image format (eps ? svg ? pdf ?)**
Would allow a pracical use of CroCo output as a ready-to-publish graph.
**allowing using reads with no quality information**
In order to be able to use CroCo on sequencing data with no quality info. It will probably only be possiblefor some mapping/quantification tools.
**implementing general parameter switches**
It would be easier for users to be able to select a general option that will automaticaly sets multiple parameters to adequate values.
Example 1 : a user might use a `--454` switch to automatically select bowtie2 and its `--very-sensitive-local` option.
Example 2 : a user might use a `--strict` switch to automatically select bowtie,
increase the value of `--fold-threshold` to 10, increase the value of `--minimum-coverage` to 2 and not allowing any mismatches when mapping with `--add-option '-v 0'`
**accelerating blast step**
It would be nice to properly parallelize blast step as I believe that, so far, it only uses 1 thread at a time.
**add an option to output the LOG file of the run directly in the output directory**
it could be done using ` > logfile`, but it would be nice that the log file also ends up within the $out directory.
**Cleaning sequencing raw data**
CroCo has been designed to clean assembled transcriptomes.
However, it would be useful to also clean-up the raw data :
this would unlock improvement for all RNA-Seq downstream analyses starting with the transcriptome assembly itself.
**Fast heuristic for CroCo**
Adding an option allowing for a partial but fast analysis in order to **anticipate risks** of cross contaminations in very large sequencing experiments.
Such a large-scale pre-screening option could allow to detect whether a full CroCo analysis is required, **before going for very heavy computational analyses**.
**Additional checking of transcriptome quality**
using the mappings to give info on various quality stats. Also comparing overexpressed transcripts on refined database.
Overall, it is about trying to help the user a bit beyond cross-contamination detection.
------
**TRIALS - TEMPORARY SECTION**
*reglage des problème de sed ??*
```bash
- (brew install gnu-sed)
```
*test macosx 10.11.3 (bureau fred)*
```bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
export PATH="$(brew --prefix coreutils)/libexec/gnubin:/usr/local/bin:$PATH"
brew tap homebrew/dupes
brew install coreutils gnu-getopt gawk grep
brew link --force gnu-getopt
cd utils
bash install_dependencies.sh --tool all --os macosx
```
*test macosx 10.6.8 (Andrea vieux portable)*
```bash
=> brew worked but git is not installed => nothing can be installed
```
*test macosx 10.11.4 (ordi Fabien)*
```bash
=> wget is not installed => NCBI must be in "local install mode"
=> problem with Salmon ! "dyld: Library not loaded: @rpath/libtbb.dylib"
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/610990
peut être que quand le script sera executable, cela fonctionnera tout seul... (espoir)
```
This message indicates that the sequence name before the first encountered spacing character is too long (more than 1000 characters). You need to shorten it.
No preview for this file type
......@@ -56,15 +56,15 @@ if [ $RECAT == "no" ]; then
START_TIME2=$SECONDS
### index contigs for the selected tool
echo -e "\nBuilding index : $out/$toolidx\n"
case "$TOOL" in
B) toolidx="ALL_transcripts_bowtie_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
B2) toolidx="ALL_transcripts_bowtie2_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie2-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
K) toolidx="ALL_transcripts_kallisto_index"; if [ ! -f $out/$toolidx ]; then kallisto index -i $out/$toolidx $out/ALL_transcripts.fasta; fi ;;
S) toolidx="ALL_transcripts_salmon_index" ; if [ ! -d $out/$toolidx ]; then salmon --no-version-check index -t $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
R) toolidx="ALL_transcripts_rapmap_index" ; if [ ! -d $out/$toolidx ]; then rapmap quasiindex -t $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
R) toolidx="ALL_transcripts_rapmap_index" ; if [ ! -d $out/$toolidx ]; then rapmap quasiindex -t $out/ALL_transcripts.fasta -p -x $PROCESSORS -i $out/$toolidx; fi ;;
H) toolidx="ALL_transcripts_hpg_index" ; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; hpg-aligner build-sa-index -g $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
esac
echo -e "\nIndex built : $out/$toolidx\n"
for (( j=0; j <i; j++ ))
do
......@@ -100,12 +100,12 @@ if [ $RECAT == "no" ]; then
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
bowtie -p $PROCESSORS $ADDOPT -a --trim5 $TRIM5 --trim3 $TRIM3 --suppress 1,2,4,5,6,7,8 $out/$toolidx/$toolidx $fastq > $out/${fasta_array[$k]}".mapping"
bowtie -p $PROCESSORS $ADDOPT -a --trim5 $TRIM5 --trim3 $TRIM3 --chunkmbs 2000 --suppress 1,2,4,5,6,7,8 $out/$toolidx/$toolidx $fastq | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' < $out/${fasta_array[$k]}".mapping" > $fileout
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
B2) if [ $MODE == "u" ]
then
......@@ -379,7 +379,7 @@ fi
### miscellaneous
if [ $RECAT == "no" ]; then
mkdir utility_files_CroCo
mv *.outblast *.suspects *.blastdb.* *_index *.ctgs *.out *.all_quants *.fasta_mod *.fasta_suspect ALL_transcripts.* utility_files_CroCo/
mv *.outblast *.suspects *.blastdb.* *_index *.ctgs *.out *.all_quants *.fasta_mod *.fasta_suspect utility_files_CroCo/
fi
cd ../
......
......@@ -56,15 +56,15 @@ if [ $RECAT == "no" ]; then
START_TIME2=$SECONDS
### index contigs for the selected tool
echo -e "\nBuilding index : $out/$toolidx\n"
case "$TOOL" in
B) toolidx="ALL_transcripts_bowtie_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
B2) toolidx="ALL_transcripts_bowtie2_index"; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; bowtie2-build --offrate 3 $out/ALL_transcripts.fasta $out/$toolidx/$toolidx ;fi ;;
K) toolidx="ALL_transcripts_kallisto_index"; if [ ! -f $out/$toolidx ]; then kallisto index -i $out/$toolidx $out/ALL_transcripts.fasta; fi ;;
S) toolidx="ALL_transcripts_salmon_index" ; if [ ! -d $out/$toolidx ]; then salmon --no-version-check index -t $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
R) toolidx="ALL_transcripts_rapmap_index" ; if [ ! -d $out/$toolidx ]; then rapmap quasiindex -t $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
R) toolidx="ALL_transcripts_rapmap_index" ; if [ ! -d $out/$toolidx ]; then rapmap quasiindex -t $out/ALL_transcripts.fasta -p -x $PROCESSORS -i $out/$toolidx; fi ;;
H) toolidx="ALL_transcripts_hpg_index" ; if [ ! -d $out/$toolidx ]; then mkdir $out/$toolidx; hpg-aligner build-sa-index -g $out/ALL_transcripts.fasta -i $out/$toolidx; fi ;;
esac
echo -e "\nIndex built : $out/$toolidx\n"
for (( j=0; j <i; j++ ))
do
......@@ -100,12 +100,12 @@ if [ $RECAT == "no" ]; then
else
fastq=" -1 "$INDIR"/"${fasta_array[$k]}".L.fastq -2 "$INDIR"/"${fasta_array[$k]}".R.fastq"
fi
bowtie -p $PROCESSORS $ADDOPT -a --trim5 $TRIM5 --trim3 $TRIM3 --suppress 1,2,4,5,6,7,8 $out/$toolidx/$toolidx $fastq > $out/${fasta_array[$k]}".mapping"
bowtie -p $PROCESSORS $ADDOPT -a --trim5 $TRIM5 --trim3 $TRIM3 --chunkmbs 2000 --suppress 1,2,4,5,6,7,8 $out/$toolidx/$toolidx $fastq | \
awk -v reads=$reads -v refseqs=$refseqALL 'BEGIN{OFS="\t"; while ((getline sequ < refseqs) > 0) {split(sequ,a,"\t");ctg[a[1]] = 0; ctgsize[a[1]]= a[2];}; close(refseqs) } {ctg[$1]++}
END{
print "Contig", reads;
for (i in ctg) totRPK += ctg[i]/ctgsize[i];
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' < $out/${fasta_array[$k]}".mapping" > $fileout
for (i in ctg) {if (totRPK > 0) {print i, (ctg[i]/ctgsize[i])*(1/totRPK)*1000000 } else {print i,"0"}} }' > $fileout
;;
B2) if [ $MODE == "u" ]
then
......@@ -379,7 +379,7 @@ fi
### miscellaneous
if [ $RECAT == "no" ]; then
mkdir utility_files_CroCo
mv *.outblast *.suspects *.blastdb.* *_index *.ctgs *.out *.all_quants *.fasta_mod *.fasta_suspect ALL_transcripts.* utility_files_CroCo/
mv *.outblast *.suspects *.blastdb.* *_index *.ctgs *.out *.all_quants *.fasta_mod *.fasta_suspect utility_files_CroCo/
fi
cd ../
......
......@@ -4,7 +4,7 @@ FOLD=2 # how many fold enrichment of correct source over others is required to
MINCOV=0.2 # min coverage of contig by "own's" read to
PROCESSORS=1 # number of threads for running the tool
OUTPUTPREFIX="" # prefix of output directory that will be created
OUTPUTLEVEL=2 # "verbosity" level for outputs
OUTPUTLEVEL=1 # "verbosity" level for fasta outputs
GRAPH=no # wether to produce a graphical output of the crosscontamination network
ADDOPT="" # string that will be added to the mapper command line
TRIM5=0 # number of bases to trim from read's 5'
......@@ -13,5 +13,5 @@ FRAGLENGTH=none # average fragment length
FRAGSD=none # standard deviation of fragment length
SUSPID=95 # blast minimum id% for being a potential suspect of cross contamination
SUSPLEN=40 # blast minimum id% for being a potential suspect of cross contamination
OVEREXP=1000 # expression level (TPM) above which a transcript might be considered "overexpressed"
OVEREXP=300 # expression level (TPM) above which a transcript might be considered "overexpressed"
RECAT=no # previous CroCo output directory to be used to re-categorize transcripts
......@@ -69,7 +69,7 @@ elif [ $OUTPUTLEVEL == "3" ]; then
cat $out/$ref".togrep" | while read line; do
ctg=`echo $line | cut -d' ' -f1`
go="_clean"
if LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.lowcov && LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.all ; then go="_lowcov" # attention : $out/$ref.lowcov contient aussi les non-supect !!!
if LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.lowcov && LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.all ; then go="_lowcov" # attention : $out/$ref.lowcov contient aussi les non-supect !!!
elif LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.contam && LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.all ; then go="_contam"
elif LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.dubious && LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.all ; then go="_dubious"
elif LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.overexp && LC_ALL=C grep -F -q -w -m1 "$ctg" $out/$ref.all ; then go="_overexp"
......
......@@ -13,15 +13,15 @@ for f in $out"/"*".all"
tot_ctgs=`grep -c '>' $INDIR/$ff.fasta`
tot_suspects=`grep -c '>' $out/$ff.fasta_suspect`
never_suspected=$(($tot_ctgs-$tot_suspects))
clean_ctgs=`grep -c 'clean' $f`
clean_ctgs=`grep -c -w 'clean' $f`
clean_percent=`echo "scale=2; ($clean_ctgs + ($tot_ctgs-$tot_suspects) ) * 100 / $tot_ctgs " | bc -l | awk '{printf "%.2f", $0}'`
contam_ctgs=`grep -c 'contam' $f`
contam_ctgs=`grep -c -w 'contam' $f`
contam_percent=`echo "scale=2; $contam_ctgs * 100 / $tot_ctgs " | bc -l | awk '{printf "%.2f", $0}'`
dubious_ctgs=`grep -c 'dubious' $f`
dubious_ctgs=`grep -c -w 'dubious' $f`
dubious_percent=`echo "scale=2; $dubious_ctgs * 100 / $tot_ctgs " | bc -l | awk '{printf "%.2f", $0}'`
lowcov_ctgs=`grep -c 'lowcov' $f`
lowcov_ctgs=`grep -c -w 'lowcov' $f`
lowcov_percent=`echo "scale=2; $lowcov_ctgs * 100 / $tot_ctgs " | bc -l | awk '{printf "%.2f", $0}'`
overexp_ctgs=`grep -c 'overexp' $f`
overexp_ctgs=`grep -c -w 'overexp' $f`
overexp_percent=`echo "scale=2; $overexp_ctgs * 100 / $tot_ctgs " | bc -l | awk '{printf "%.2f", $0}'`
echo -e "$ff\t"$tot_ctgs"\t"$never_suspected"\t"$tot_suspects"\t"$clean_ctgs"\t"$lowcov_ctgs"\t"$overexp_ctgs"\t"$dubious_ctgs"\t"$contam_ctgs"\t"$clean_percent"\t"$lowcov_percent"\t"$overexp_percent"\t"$dubious_percent"\t"$contam_percent >> $out"/CroCo_summary"
#sed 's/\./,/g' $out"/CroCo_summary" > $out"/CroCo_summary.tmp"
......
......@@ -13,15 +13,15 @@ for f in $out"/"*".all"