Commit 8dad754c authored by peguerin's avatar peguerin

add sh cripts

parent 9e4096ce
# reference_database
Collection of scripts to build a reference database.
\ No newline at end of file
Collection of scripts to build a reference database.
# reference database built from EMBL taxonomy and sequences
This method is based on [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)'s reference database.
## Installation
To build this reference database, you will a local copy of this repository
* open a shell
* make a folder, name it yourself, I named it workdir
```
mkdir workdir
cd workdir
```
* clone the project and switch to the main folder, it's your working directory
```
git clone http://gitlab.mbb.univ-montp2.fr/edna/reference_database.git
cd reference_database
```
## Dependencies
You will also need to have the following programs installed on your computer.
- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home)
## Preparation
- install dependencies
- clone the project (see [Installation](#installation) section)
- fulfill [config.sh](config.sh) and read [ecoPCR ](https://pythonhosted.org/OBITools/scripts/ecoPCR.html?highlight=ecopcr) documentation
## Build a reference database
Once you have fulfilled the [config.sh](config.sh) files with the right parameters, simply run the following command into the current folder `reference_database`
```
bash build_bdr.sh
```
It will be very long to download the sequences and process the *in silico* PCR and filtering steps. Be sure you can run without interruption this script for several days.
I recommand you to open [build_bdr.sh](build_bdr.sh) and to run it step by step. For more information, this script is based on the [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)'s tutorial available [here](https://pythonhosted.org/OBITools/wolves.html).
## Results
let's define {prefix} as the prefix of the names of reference database as it's defined in [config.sh](config.sh)
The folder of the project will contain `{prefix}_*.sdx` files and a `db_{prefix}.fasta` file.
In addition, it includes `EMBL` folder which contains all the sequences
## Use the reference database
Now, your reference database ban be used for taxonomic assignment in our pipeline to generate species environmental presence from raw eDNA data.
You can use the absolute path of the folder of your reference database as the `/path/to/baseofreference` argument in [only_obitools](http://gitlab.mbb.univ-montp2.fr/edna/only_obitools) and [snakemake_only_obitools](http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools)
# Build a reference database
# configure arguments value
source ./config.sh
# download the sequences
mkdir EMBL
cd EMBL
wget ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/std/*
gzip -d *
cd ..
# download taxonomy
mkdir TAXO
cd TAXO
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
cd ..
# format the data
obiconvert --skip-on-error --embl -t ./TAXO --ecopcrdb-output="${rd_prefix}" EMBL/rel_std_*.dat
# ecoPCR to simulate an in silico PCR
# 50 :: change to 20 : lost of lamproie
ecoPCR -d "${rd_prefix}" -e "${ecoPCR_e}" -l "${ecoPCR_l}" -L "${ecoPCR_L}" "${primer5}" "${primer3}" > v_"${rd_prefix}".ecopcr
# clean the database
## filter sequences so that they have a good taxonomic description at the species genus and family levels
obigrep -d "${rd_prefix}" --require-rank=species --require-rank=genus --require-rank=family v_"${rd_prefix}".ecopcr > v_"${rd_prefix}"_clean.fasta
## remove redundant sequences
obiuniq -d "${rd_prefix}" v_"${rd_prefix}"_clean.fasta > v_"${rd_prefix}"_clean_uniq.fasta
## ensure that the dereplicated sequences have a taxid at the family level
obigrep -d "${rd_prefix}" --require-rank=family v_"${rd_prefix}"_clean_uniq.fasta > v_"${rd_prefix}"_clean_uniq_clean.fasta
## ensure that sequences each have a unique identification
obiannotate --uniq-id v_"${rd_prefix}"_clean_uniq_clean.fasta > db_"${rd_prefix}".fasta
# your reference database is built !
#add spygen taxonomy [doesn't seems to work]
#obitaxonomy -d "${rd_prefix}" -a 'Cnasus_Ctoxo_Tsouffia':'species':10000087
#obitaxonomy -d "${rd_prefix}" -a 'Cidella_Hmolitrix':'species':10000088
# argument values for building reference database
## reference database prefix
rd_prefix="embl_std"
## ecoPCR arguments
### [-e] Maximum number of errors (mismatches) allowed per primer.
ecoPCR_e=3
### [-l] Minimum length of the in silico amplified DNA fragment, excluding primers.
ecoPCR_l=50
### [-L] Maximum length of the in silico amplified DNA fragment, excluding primers.
ecoPCR_L=150
### 5' primer sequence
primer5=ACACCGCCCGTCACTCT
### 3' primer sequence
primer3=CTTCCGGTACACTTACCATG
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment