... | ... | @@ -25,17 +25,16 @@ The module **[validate](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_ |
|
|
mkbdr validate --fasta tests/data/raw.fasta --output_prefix res_raw
|
|
|
```
|
|
|
|
|
|
This will outputs:
|
|
|
This outputs:
|
|
|
|
|
|
```
|
|
|
Checking arguments...done.
|
|
|
Validate records...
|
|
|
Loading local NCBI taxonomy...done.
|
|
|
4 processed records.
|
|
|
On these records, 2 are valid, 0 are faulty format and 2 are faulty taxon.
|
|
|
```
|
|
|
|
|
|
* `res_raw_faulty_format.fasta` : a FASTA file with faulty format records (empty in this example)
|
|
|
* `res_raw_faulty_format.fasta`: a FASTA file with faulty format records (empty in this example)
|
|
|
* `res_raw_faulty_taxon.fasta`: a FASTA file with faulty taxonomy records (2 faulty records in this example)
|
|
|
* `res_raw_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (2 valid records in this example)
|
|
|
|
... | ... | @@ -43,19 +42,142 @@ Read more details about output files [here](https://gitlab.mbb.univ-montp2.fr/ed |
|
|
|
|
|
:seedling: On 4 records, MKBDR asserted 2 records with a faulty taxonomy. Faulty taxonomy means that the name of the species is unknown in NCBI taxonomy. So it is impossible to use these records in a reference database. We will need to curate their taxonomy to validate these records.
|
|
|
|
|
|
## 2. Curation
|
|
|
## 2. Curation generation
|
|
|
|
|
|
The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition)
|
|
|
The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition). This curation file will be helpful later to apply curation on faulty records.
|
|
|
|
|
|
**curegen** needs the faulty taxonomy FASTA records file as input and will output a corresponding curation file.
|
|
|
|
|
|
Shortly, **curegen** seeks synonyms of the faulty species name into NCBI for each records. If no synonym is found, then it seeks in global databases a the closest genus name related to NCBI taxonomy. By default the consulted global database is 'Fishbase' but let's consult 'Catalogue of Life' instead:
|
|
|
|
|
|
```
|
|
|
mkbdr curegen --fasta res_raw_faulty_taxon.fasta
|
|
|
--database_globalnames 'Catalogue of Life' \
|
|
|
--output_prefix res_raw
|
|
|
```
|
|
|
|
|
|
This will output:
|
|
|
|
|
|
```
|
|
|
Curation generation...
|
|
|
Loading local NCBI taxonomy...done.
|
|
|
Trying fuzzy search for Albula forsteri
|
|
|
FOUND! Albula forsteri taxid:531982 score:0 (1.0)
|
|
|
Trying fuzzy search for Stegastes xanthurus
|
|
|
current_name ncbi_name genus family ncbi_rank method
|
|
|
0 Albula forsteri Albula argentea Albula Albulidae species NCBI synonym score=1.0
|
|
|
1 Stegastes xanthurus NA Stegastes Pomacentridae genus Catalogue of Life
|
|
|
```
|
|
|
|
|
|
* `res_raw_curation.csv`: a CSV file with instructions to cure taxonomies (2 records)
|
|
|
|
|
|
As we can read, on 2 faulty taxonomy records, a NCBI synonym was found for one, and for the other a genus name has been identified from the Catalogue of Life that can be used in NCBI.
|
|
|
|
|
|
|
|
|
:seedling: We generated taxonomy curation instructions for the 2 faulty taxonomy records. Next step will apply this curation on records.
|
|
|
|
|
|
|
|
|
## 3. Validate and curate
|
|
|
|
|
|
It is possible to apply taxonomy curation with the **validate** module using `--curate` option. The inputs are:
|
|
|
|
|
|
* `raw.fasta`
|
|
|
* `res_raw_curation.csv`
|
|
|
|
|
|
To apply taxonomy curation only for NCBI synonyms:
|
|
|
|
|
|
```
|
|
|
mkbdr validate --fasta raw.fasta \
|
|
|
--curate res_raw_curation.csv \
|
|
|
--output_prefix res_curated
|
|
|
```
|
|
|
|
|
|
This outputs:
|
|
|
|
|
|
```
|
|
|
Validate records...
|
|
|
Loading local NCBI taxonomy...done.
|
|
|
Curating records with faulty taxonomy...
|
|
|
curation not performed on record YCA_R0449;
|
|
|
On 2 faulty records, 1 records are curated.
|
|
|
4 processed records.
|
|
|
On these records, 3 are valid, 0 are faulty format and 1 are faulty taxon.
|
|
|
```
|
|
|
|
|
|
The faulty taxonomy records for which a NCBI synonym was found is curated. So that 3 valid records are produced. The faulty species name has been replaced by its NCBI synonym.
|
|
|
|
|
|
* `res_curated_faulty_format.fasta`: a FASTA file with faulty format records (empty in this example)
|
|
|
* `res_curated_faulty_taxon.fasta`: a FASTA file with faulty taxonomy records (1 faulty record in this step)
|
|
|
* `res_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (3 valid records in this step)
|
|
|
|
|
|
|
|
|
mkbdr curegen --database_globalnames 'Catalogue of Life' --output_prefix res_raw --fasta res_raw_faulty_taxon.fasta
|
|
|
:seedling: Using `--curate` option, it is possible to apply curation on faulty taxonomy records. Nevertheless, curation is not complete as one faulty records remains. This one requires to add nodes to the NCBI taxonomy tree itself.
|
|
|
|
|
|
mkbdr validate --fasta tests/data/raw.fasta --curate res_raw_curation.csv --output_prefix res_curated
|
|
|
## 4. Load a local NCBI taxonomy
|
|
|
|
|
|
In order to produce a custom NCBI taxonomy tree, we will need to download and decompress the NCBI taxdump
|
|
|
|
|
|
```
|
|
|
mkbdr init_ncbi_taxdump --folder_path customtaxonomy/ --decompress
|
|
|
```
|
|
|
|
|
|
This will output:
|
|
|
|
|
|
```
|
|
|
citations.dmp
|
|
|
delnodes.dmp
|
|
|
division.dmp
|
|
|
gencode.dmp
|
|
|
merged.dmp
|
|
|
names.dmp
|
|
|
nodes.dmp
|
|
|
gc.prt
|
|
|
readme.txt
|
|
|
NCBI taxdump initialisation successful !
|
|
|
Path: customtaxonomy/
|
|
|
```
|
|
|
|
|
|
The NCBI taxonomy files are stored in the local folder `customtaxonomy/`
|
|
|
|
|
|
|
|
|
:seedling: The local NCBI taxonomy will be useful for the next step as we need to edit NCBI taxonomy in order to add new taxon group.
|
|
|
|
|
|
|
|
|
## 5. Validate, curate and edit local taxonomy
|
|
|
|
|
|
MKBDR can edit local taxonomy with the option `--ncbi_taxonomy_edition`.
|
|
|
|
|
|
```
|
|
|
mkbdr validate --fasta raw.fasta \
|
|
|
--curate res_raw_curation.csv \
|
|
|
--ncbi_taxonomy_edition customtaxonomy/ \
|
|
|
--output_prefix res_taxo_curated
|
|
|
```
|
|
|
|
|
|
|
|
|
This will output:
|
|
|
|
|
|
```
|
|
|
Validate records...
|
|
|
Loading local NCBI taxonomy...done.
|
|
|
Curating records with faulty taxonomy...
|
|
|
Editing ncbi_taxdump files...done.
|
|
|
1 new nodes have been added in customtaxonomy/nodes.dmp and customtaxonomy/names.dmp.
|
|
|
On 2 faulty records, 2 records are curated.
|
|
|
4 processed records.
|
|
|
On these records, 4 are valid, 0 are faulty format and 0 are faulty taxon.
|
|
|
```
|
|
|
The local taxonomy located on `customtaxonomy/` has been edited and all the faulty taxonomy records are curated.
|
|
|
|
|
|
:seedling: Finally the following files constitute our reference database:
|
|
|
|
|
|
* `res_taxo_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (4 valid records in this step)
|
|
|
* `customtaxonomy`: custom NCBI taxonomy
|
|
|
|
|
|
|
|
|
## 6. Test the custom reference database
|
|
|
|
|
|
We will use
|
|
|
|
|
|
mkbdr validate --fasta tests/data/raw.fasta --curate res_raw_curation.csv --ncbi_taxonomy_edition customtaxonomy/ --output_prefix res_taxo_curated
|
|
|
|
|
|
|