# Custom Metabarcoding Reference Database [![Twitter Follow](https://img.shields.io/twitter/follow/ephe_bev?style=social)](https://twitter.com/ephe_bev) Scripts to convert FASTA files into reference database linked to NCBI taxonomy. ## Introduction scripts to create our own reference database with our own sequences only and using the NCBI taxonomy ## Workflow * inputs: * FASTA file 0. get raw fasta files of new sequences with species-names 1. Extract sequence name 2. Check sequence name format 3. Check sequences format (iuapc ambiguity, gaps) 4. Correct NCBI-taxonomy species name (this is semi-automatic) 5. Attribute NCBI-taxonomy taxid 6. Extract names with missing taxid 1. Attribute NCBI-taxonomy taxid of genus 2. Run obitaxonommy command for unattributed taxid species 7. Write fasta file of sequences with their taxid and complete genus-species name * outputs: * formatted FASTA file * .ldx new nodes for missing taxid into the taxonomy to link to existing genus/family taxid 1. `raw fasta` --> validate --> `valide fasta` `faulty_format fasta` `faulty_taxon fasta` 2. `faulty_taxon fasta` --> curate (actuellement Laetitia qui fait ce job) --> `curated_taxon csv` 3. verifier et corriger à la main le tableau `curated_taxon csv` 4. `raw fasta` `curated_taxon csv` --> validate --> `valide fasta` et mise à jour de la taxonomy ## Environment To create environments with required softwares: ``` conda env create -f envs/obitools_envs.yaml conda env create -f envs/pylib_cbdr.yaml ``` * Obitools ``` conda activate obitools ``` * Required python libraries to build custom reference database ``` conda activate pylib_cbdr ``` ## Usage First time loading the taxdump ``` mkbdr validate --fasta resources/test/raw.fasta \ --ncbi_taxdump "TAXO/taxdump_2021.tar.gz" \ --output_prefix "test_raw" ``` taxdump previously loaded (faster) ``` mkbdr validate --fasta resources/test/raw.fasta \ --output_prefix "test_raw" ``` Apply curation ``` mkbdr validate --fasta resources/test/raw.fasta \ --curate curated_taxon.csv --output_prefix "test_curated" ``` Generate a curation csv file ``` mkbdr curegen --fasta test_raw_faulty_taxon.fasta \ --output_prefix "test" ``` Specify the globalnames database to query ``` mkbdr curegen --fasta test_raw_faulty_taxon.fasta \ --output_prefix "test" \ --database_globalnames 'Catalogue of Life' ``` _______________________________________________________________________________ crash test ``` python3 mkbdr validate --fasta teleo_ok.fasta --curate curated_taxon.csv --output_prefix "truc" ```