Here, we produce a valid reference database using **MKBDR** and test it with ecotag.
## Installation
See [Installing MKBDR](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Installing-MKBDR) for installation instructions.
## Example data
Download example data with:
curl -LJO https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/raw/master/tests/data/raw.fasta
* `raw.fasta`: a FASTA file of 4 records representative sequence of 4 taxon groups. More details about input files [here](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition#representative-sequences).
## 1. First validation
The module **[validate](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Validate)** produces a valid records FASTA file and faulty records FASTA file.
mkbdr validate --fasta tests/data/raw.fasta --output_prefix res_raw
This will outputs:
Checking arguments...done.
Validate records...
Loading local NCBI taxonomy...done.
4 processed records.
On these records, 2 are valid, 0 are faulty format and 2 are faulty taxon.
* `res_raw_faulty_format.fasta` : a FASTA file with faulty format records (empty in this example)
* `res_raw_faulty_taxon.fasta`: a FASTA file with faulty taxonomy records (2 faulty records in this example)
* `res_raw_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (2 valid records in this example)
Read more details about output files [here](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition#output-files).
:seedling: On 4 records, `validate` asserted 2 records with a faulty taxonomy. Faulty taxonomy means that the name of the species is unknown in NCBI taxonomy. So it is impossible to use these records in a reference database. We will need to curate their taxonomy to validate these records.
## 2. Curation
The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition)
mkbdr curegen --database_globalnames 'Catalogue of Life' --output_prefix res_raw --fasta res_raw_faulty_taxon.fasta
... | ... | |