| ... | ... | @@ -25,10 +25,9 @@ The module **[validate](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_ | 
|  |  | mkbdr validate --fasta tests/data/raw.fasta --output_prefix res_raw | 
|  |  | ``` | 
|  |  |  | 
|  |  | This will outputs: | 
|  |  | This outputs: | 
|  |  |  | 
|  |  | ``` | 
|  |  | Checking arguments...done. | 
|  |  | Validate records... | 
|  |  | Loading local NCBI taxonomy...done. | 
|  |  | 4 processed records. | 
| ... | ... | @@ -43,19 +42,142 @@ Read more details about output files [here](https://gitlab.mbb.univ-montp2.fr/ed | 
|  |  |  | 
|  |  | :seedling: On 4 records, MKBDR asserted 2 records with a faulty taxonomy. Faulty taxonomy means that the name of the species is unknown in NCBI taxonomy. So it is impossible to use these records in a reference database. We will need to curate their taxonomy to validate these records. | 
|  |  |  | 
|  |  | ## 2. Curation | 
|  |  | ## 2. Curation generation | 
|  |  |  | 
|  |  | The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition) | 
|  |  | The module **[curegen](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Curegen)** produces a [curation CSV file](https://gitlab.mbb.univ-montp2.fr/edna/custom_reference_database/-/wikis/Files-definition). This curation file will be helpful later to apply curation on faulty records. | 
|  |  |  | 
|  |  | **curegen** needs the faulty taxonomy FASTA records file as input and will output a corresponding curation file. | 
|  |  |  | 
|  |  | Shortly, **curegen** seeks synonyms of the faulty species name into NCBI for each records. If no synonym is found, then it seeks in global databases a the closest genus name related to NCBI taxonomy. By default the consulted global database is 'Fishbase' but let's consult 'Catalogue of Life' instead: | 
|  |  |  | 
|  |  | ``` | 
|  |  | mkbdr curegen --fasta res_raw_faulty_taxon.fasta | 
|  |  | --database_globalnames 'Catalogue of Life' \ | 
|  |  | --output_prefix res_raw | 
|  |  | ``` | 
|  |  |  | 
|  |  | This will output: | 
|  |  |  | 
|  |  | ``` | 
|  |  | Curation generation... | 
|  |  | Loading local NCBI taxonomy...done. | 
|  |  | Trying fuzzy search for Albula forsteri | 
|  |  | FOUND!    Albula forsteri taxid:531982 score:0 (1.0) | 
|  |  | Trying fuzzy search for Stegastes xanthurus | 
|  |  | current_name        ncbi_name      genus         family ncbi_rank                  method | 
|  |  | 0      Albula forsteri  Albula argentea     Albula      Albulidae   species  NCBI synonym score=1.0 | 
|  |  | 1  Stegastes xanthurus               NA  Stegastes  Pomacentridae     genus       Catalogue of Life | 
|  |  | ``` | 
|  |  |  | 
|  |  | * `res_raw_curation.csv`: a CSV file with instructions to cure taxonomies (2 records) | 
|  |  |  | 
|  |  | As we can read, on 2 faulty taxonomy records, a NCBI synonym was found for one, and for the other a genus name has been identified from the Catalogue of Life that can be used in NCBI. | 
|  |  |  | 
|  |  |  | 
|  |  | :seedling: We generated taxonomy curation instructions for the 2 faulty taxonomy records. Next step will apply this curation on records. | 
|  |  |  | 
|  |  |  | 
|  |  | ## 3. Validate and curate | 
|  |  |  | 
|  |  | It is possible to apply taxonomy curation with the **validate** module using `--curate` option. The inputs are: | 
|  |  |  | 
|  |  | * `raw.fasta` | 
|  |  | * `res_raw_curation.csv` | 
|  |  |  | 
|  |  | To apply taxonomy curation only for NCBI synonyms: | 
|  |  |  | 
|  |  | ``` | 
|  |  | mkbdr validate --fasta raw.fasta \ | 
|  |  | --curate res_raw_curation.csv \ | 
|  |  | --output_prefix res_curated | 
|  |  | ``` | 
|  |  |  | 
|  |  | This outputs: | 
|  |  |  | 
|  |  | ``` | 
|  |  | Validate records... | 
|  |  | Loading local NCBI taxonomy...done. | 
|  |  | Curating records with faulty taxonomy... | 
|  |  | curation not performed on record YCA_R0449; | 
|  |  | On 2 faulty records, 1 records are curated. | 
|  |  | 4 processed records. | 
|  |  | On these records, 3 are valid, 0 are faulty format and 1 are faulty taxon. | 
|  |  | ``` | 
|  |  |  | 
|  |  | The faulty taxonomy records for which a NCBI synonym was found is curated. So that 3 valid records are produced. The faulty species name has been replaced by its NCBI synonym. | 
|  |  |  | 
|  |  | * `res_curated_faulty_format.fasta`: a FASTA file with faulty format records (empty in this example) | 
|  |  | * `res_curated_faulty_taxon.fasta`: a FASTA file with faulty taxonomy records (1 faulty record in this step) | 
|  |  | * `res_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (3 valid records in this step) | 
|  |  |  | 
|  |  |  | 
|  |  | mkbdr curegen --database_globalnames 'Catalogue of Life' --output_prefix res_raw --fasta res_raw_faulty_taxon.fasta | 
|  |  | :seedling: Using `--curate` option, it is possible to apply curation on faulty taxonomy records. Nevertheless, curation is not complete as one faulty records remains. This one requires to add nodes to the NCBI taxonomy tree itself. | 
|  |  |  | 
|  |  | mkbdr validate --fasta tests/data/raw.fasta --curate res_raw_curation.csv --output_prefix res_curated | 
|  |  | ## 4. Load a local NCBI taxonomy | 
|  |  |  | 
|  |  | In order to produce a custom NCBI taxonomy tree, we will need to download and decompress the NCBI taxdump | 
|  |  |  | 
|  |  | ``` | 
|  |  | mkbdr init_ncbi_taxdump --folder_path customtaxonomy/ --decompress | 
|  |  | ``` | 
|  |  |  | 
|  |  | This will output: | 
|  |  |  | 
|  |  | ``` | 
|  |  | citations.dmp | 
|  |  | delnodes.dmp | 
|  |  | division.dmp | 
|  |  | gencode.dmp | 
|  |  | merged.dmp | 
|  |  | names.dmp | 
|  |  | nodes.dmp | 
|  |  | gc.prt | 
|  |  | readme.txt | 
|  |  | NCBI taxdump initialisation successful ! | 
|  |  | Path: customtaxonomy/ | 
|  |  | ``` | 
|  |  |  | 
|  |  | The NCBI taxonomy files are stored in the local folder `customtaxonomy/` | 
|  |  |  | 
|  |  |  | 
|  |  | :seedling: The local NCBI taxonomy will be useful for the next step as we need to edit NCBI taxonomy in order to add new taxon group. | 
|  |  |  | 
|  |  |  | 
|  |  | ## 5. Validate, curate and edit local taxonomy | 
|  |  |  | 
|  |  | MKBDR can edit local taxonomy with the option `--ncbi_taxonomy_edition`. | 
|  |  |  | 
|  |  | ``` | 
|  |  | mkbdr validate --fasta raw.fasta \ | 
|  |  | --curate res_raw_curation.csv \ | 
|  |  | --ncbi_taxonomy_edition  customtaxonomy/ \ | 
|  |  | --output_prefix res_taxo_curated | 
|  |  | ``` | 
|  |  |  | 
|  |  |  | 
|  |  | This will output: | 
|  |  |  | 
|  |  | ``` | 
|  |  | Validate records... | 
|  |  | Loading local NCBI taxonomy...done. | 
|  |  | Curating records with faulty taxonomy... | 
|  |  | Editing ncbi_taxdump files...done. | 
|  |  | 1 new nodes have been added in customtaxonomy/nodes.dmp and customtaxonomy/names.dmp. | 
|  |  | On 2 faulty records, 2 records are curated. | 
|  |  | 4 processed records. | 
|  |  | On these records, 4 are valid, 0 are faulty format and 0 are faulty taxon. | 
|  |  | ``` | 
|  |  | The local taxonomy located on `customtaxonomy/` has been edited and all the faulty taxonomy records are curated. | 
|  |  |  | 
|  |  | :seedling: Finally the following files constitute our reference database: | 
|  |  |  | 
|  |  | * `res_taxo_curated_valide.fasta`: a FASTA file with correct records that can be use as reference database for taxonomic assignment (4 valid records in this step) | 
|  |  | * `customtaxonomy`: custom NCBI taxonomy | 
|  |  |  | 
|  |  |  | 
|  |  | ## 6. Test the custom reference database | 
|  |  |  | 
|  |  | We will use | 
|  |  |  | 
|  |  | mkbdr validate --fasta tests/data/raw.fasta  --curate res_raw_curation.csv --ncbi_taxonomy_edition  customtaxonomy/ --output_prefix res_taxo_curated | 
|  |  |  | 
|  |  |  |