Update Running MKBDR authored by peguerin's avatar peguerin
......@@ -25,10 +25,9 @@ The module **[validate](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_
mkbdr validate --fasta tests/data/raw.fasta --output_prefix res_raw
```
This will outputs:
This outputs:
```
Checking arguments...done.
Validate records...
Loading local NCBI taxonomy...done.
4 processed records.
......@@ -43,19 +42,142 @@ Read more details about output files [here](https://gitlab.mbb.univ-montp2.fr/ed
:seedling: On 4 records, MKBDR asserted 2 records with a faulty taxonomy. Faulty taxonomy means that the name of the species is unknown in NCBI taxonomy. So it is impossible to use these records in a reference database. We will need to curate their taxonomy to validate these records.
## 2. Curation
## 2. Curation generation
The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition)
The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition). This curation file will be helpful later to apply curation on faulty records.
**curegen** needs the faulty taxonomy FASTA records file as input and will output a corresponding curation file.
Shortly, **curegen** seeks synonyms of the faulty species name into NCBI for each records. If no synonym is found, then it seeks in global databases a the closest genus name related to NCBI taxonomy. By default the consulted global database is 'Fishbase' but let's consult 'Catalogue of Life' instead:
```
mkbdr curegen --fasta res_raw_faulty_taxon.fasta
--database_globalnames 'Catalogue of Life' \
--output_prefix res_raw
```
This will output:
```
Curation generation...
Loading local NCBI taxonomy...done.
Trying fuzzy search for Albula forsteri
FOUND! Albula forsteri taxid:531982 score:0 (1.0)
Trying fuzzy search for Stegastes xanthurus
current_name ncbi_name genus family ncbi_rank method
0 Albula forsteri Albula argentea Albula Albulidae species NCBI synonym score=1.0
1 Stegastes xanthurus NA Stegastes Pomacentridae genus Catalogue of Life
```
* `res_raw_curation.csv`: a CSV file with instructions to cure taxonomies (2 records)
As we can read, on 2 faulty taxonomy records, a NCBI synonym was found for one, and for the other a genus name has been identified from the Catalogue of Life that can be used in NCBI.
:seedling: We generated taxonomy curation instructions for the 2 faulty taxonomy records. Next step will apply this curation on records.
## 3. Validate and curate
It is possible to apply taxonomy curation with the **validate** module using `--curate` option. The inputs are:
* `raw.fasta`
* `res_raw_curation.csv`
To apply taxonomy curation only for NCBI synonyms:
```
mkbdr validate --fasta raw.fasta \
--curate res_raw_curation.csv \
--output_prefix res_curated
```
This outputs:
```
Validate records...
Loading local NCBI taxonomy...done.
Curating records with faulty taxonomy...
curation not performed on record YCA_R0449;
On 2 faulty records, 1 records are curated.
4 processed records.
On these records, 3 are valid, 0 are faulty format and 1 are faulty taxon.
```
The faulty taxonomy records for which a NCBI synonym was found is curated. So that 3 valid records are produced. The faulty species name has been replaced by its NCBI synonym.
* `res_curated_faulty_format.fasta`: a FASTA file with faulty format records (empty in this example)
* `res_curated_faulty_taxon.fasta`: a FASTA file with faulty taxonomy records (1 faulty record in this step)
* `res_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (3 valid records in this step)
mkbdr curegen --database_globalnames 'Catalogue of Life' --output_prefix res_raw --fasta res_raw_faulty_taxon.fasta
:seedling: Using `--curate` option, it is possible to apply curation on faulty taxonomy records. Nevertheless, curation is not complete as one faulty records remains. This one requires to add nodes to the NCBI taxonomy tree itself.
mkbdr validate --fasta tests/data/raw.fasta --curate res_raw_curation.csv --output_prefix res_curated
## 4. Load a local NCBI taxonomy
In order to produce a custom NCBI taxonomy tree, we will need to download and decompress the NCBI taxdump
```
mkbdr init_ncbi_taxdump --folder_path customtaxonomy/ --decompress
```
This will output:
```
citations.dmp
delnodes.dmp
division.dmp
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
gc.prt
readme.txt
NCBI taxdump initialisation successful !
Path: customtaxonomy/
```
The NCBI taxonomy files are stored in the local folder `customtaxonomy/`
:seedling: The local NCBI taxonomy will be useful for the next step as we need to edit NCBI taxonomy in order to add new taxon group.
## 5. Validate, curate and edit local taxonomy
MKBDR can edit local taxonomy with the option `--ncbi_taxonomy_edition`.
```
mkbdr validate --fasta raw.fasta \
--curate res_raw_curation.csv \
--ncbi_taxonomy_edition customtaxonomy/ \
--output_prefix res_taxo_curated
```
This will output:
```
Validate records...
Loading local NCBI taxonomy...done.
Curating records with faulty taxonomy...
Editing ncbi_taxdump files...done.
1 new nodes have been added in customtaxonomy/nodes.dmp and customtaxonomy/names.dmp.
On 2 faulty records, 2 records are curated.
4 processed records.
On these records, 4 are valid, 0 are faulty format and 0 are faulty taxon.
```
The local taxonomy located on `customtaxonomy/` has been edited and all the faulty taxonomy records are curated.
:seedling: Finally the following files constitute our reference database:
* `res_taxo_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (4 valid records in this step)
* `customtaxonomy`: custom NCBI taxonomy
## 6. Test the custom reference database
We will use
mkbdr validate --fasta tests/data/raw.fasta --curate res_raw_curation.csv --ncbi_taxonomy_edition customtaxonomy/ --output_prefix res_taxo_curated