Scripts to convert FASTA files into reference database linked to NCBI taxonomy.
Scripts to convert FASTA files into reference database with NCBI taxonomy.
## Introduction
scripts to create our own reference database with our own sequences only and using the NCBI taxonomy
**mkbdr** is a python program designed to create reference database from FASTA file using the NCBI taxonomy. It also provides tools to assist and perform taxonomy curation on the input FASTA file.
...
...
@@ -17,28 +17,29 @@ scripts to create our own reference database with our own sequences only and usi
* inputs:
* FASTA file
0. get raw fasta files of new sequences with species-names
1.Extract sequence name
2. Check sequence name format
3. Check sequences format (iuapc ambiguity, gaps)
4. Correct NCBI-taxonomy species name (this is semi-automatic)
5. Attribute NCBI-taxonomy taxid
6.Extract names with missing taxid
1. Attribute NCBI-taxonomy taxid of genus
2. Run obitaxonommy command for unattributed taxid species
7. Write fasta file of sequences with their taxid and complete genus-species name
1.Check FASTA format
2. Check species name format
3. Check DNA sequence format
4. Check species name against NCBI taxonomy
5. Attribute NCBI taxid
6.Write `valid`, `faulty_taxon` and `faulty_format` FASTA files
7. Curate species name using `curation` CSV file
8. Write new nodes in NCBI taxonomy for unattributed taxid species
9. Write `valid` FASTA files
* outputs:
* formatted FASTA file
* .ldx new nodes for missing taxid into the taxonomy to link to existing genus/family taxid