Loading README.md +13 −1 Original line number Diff line number Diff line Loading @@ -31,6 +31,10 @@ You will also need to have the following programs installed on your computer. - [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools) - [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) - [python 3.6.6](https://www.python.org/downloads/release/python-366/) - `python 3 library` [biopython](https://biopython.org/) - `python 3 library` [etetoolkit](http://etetoolkit.org/) ## Preparation Loading @@ -41,7 +45,7 @@ You will also need to have the following programs installed on your computer. ## Build a reference database * Overview of the steps 1. Download the sequences¶ 1. Download the sequences 2. Download the taxonomy 3. Use [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) to simulate an *in silico* PCR Loading Loading @@ -73,3 +77,11 @@ You can use the absolute path of the folder of your reference database as the `/ * [nextflow_obitools](https://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools) * [snakemake_only_obitools](http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools) ## Reference database in STAMPA format Once you generated you're reference database, you need the `db_{prefix}.fasta` file (see [results](Results) section). To convert your reference database in STAMPA format type : ``` python3 scripts/formatDB_obifasta_to_stampa.py -f db_{prefix}.fasta -o /path/to/reference_database_stampa.fasta ``` It will generate a file `/path/to/reference_database_stampa.fasta` which you can use as a reference database into the pipeline [bash_swarm](https://gitlab.mbb.univ-montp2.fr/edna/bash_swarm) No newline at end of file scripts/formatDB_obifasta_to_stampa.py 0 → 100644 +72 −0 Original line number Diff line number Diff line #=============================================================================== #INFORMATIONS #=============================================================================== """ CEFE - EPHE - YERSIN 2018 guerin pierre-edouard from reference database obitools fasta files, it creates a stampa-format reference database """ #=============================================================================== #USAGES #=============================================================================== """ input : path to FASTA obitools file with "species=" information field available output : path to write the STAMPA-format FASTA reference databasae file command : python3 ncbi_to_stampa.py -f input -o output """ #=============================================================================== #MODULES #=============================================================================== import Bio from Bio import SeqIO from Bio.Alphabet import IUPAC from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord import argparse from ete3 import NCBITaxa #=============================================================================== #ARGUMENTS #=============================================================================== parser = argparse.ArgumentParser(description='convert fasta obitools to stampa format') parser.add_argument("-o","--output", type=str) parser.add_argument("-f","--fasta",type=str) #=============================================================================== #MAIN #=============================================================================== args = parser.parse_args() outputFile = args.output fastaFile = args.fasta ncbi = NCBITaxa() mes_records=[] for seq_record in SeqIO.parse(fastaFile, "fasta",alphabet=IUPAC.unambiguous_dna): local_id=seq_record.id sr_description=seq_record.description.split(';') #print(local_id) #print(sr_description) for elem in sr_description: #print(elem.split('=')[0]) if len(elem) == 0: continue if elem.split('=')[0] == " species": cmd_get_dico="id_species="+str(elem.split('=')[1]) exec(cmd_get_dico) if id_species != -1: lineage=ncbi.get_lineage(id_species) names = ncbi.get_taxid_translator(lineage) all_tax_species="" for taxid in lineage[2:]: all_tax_species+=names[taxid] all_tax_species+="|" all_tax_species=all_tax_species.replace(" ","+") else: continue local_seq=Seq(str(repr(str(seq_record.seq.lower()))).replace("'",""),IUPAC.unambiguous_dna) local_record=SeqRecord(local_seq,id=seq_record.id,description=all_tax_species[:-1]) mes_records.append(local_record) SeqIO.write(mes_records, outputFile, "fasta") Loading
README.md +13 −1 Original line number Diff line number Diff line Loading @@ -31,6 +31,10 @@ You will also need to have the following programs installed on your computer. - [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools) - [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) - [python 3.6.6](https://www.python.org/downloads/release/python-366/) - `python 3 library` [biopython](https://biopython.org/) - `python 3 library` [etetoolkit](http://etetoolkit.org/) ## Preparation Loading @@ -41,7 +45,7 @@ You will also need to have the following programs installed on your computer. ## Build a reference database * Overview of the steps 1. Download the sequences¶ 1. Download the sequences 2. Download the taxonomy 3. Use [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home) to simulate an *in silico* PCR Loading Loading @@ -73,3 +77,11 @@ You can use the absolute path of the folder of your reference database as the `/ * [nextflow_obitools](https://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools) * [snakemake_only_obitools](http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools) ## Reference database in STAMPA format Once you generated you're reference database, you need the `db_{prefix}.fasta` file (see [results](Results) section). To convert your reference database in STAMPA format type : ``` python3 scripts/formatDB_obifasta_to_stampa.py -f db_{prefix}.fasta -o /path/to/reference_database_stampa.fasta ``` It will generate a file `/path/to/reference_database_stampa.fasta` which you can use as a reference database into the pipeline [bash_swarm](https://gitlab.mbb.univ-montp2.fr/edna/bash_swarm) No newline at end of file
scripts/formatDB_obifasta_to_stampa.py 0 → 100644 +72 −0 Original line number Diff line number Diff line #=============================================================================== #INFORMATIONS #=============================================================================== """ CEFE - EPHE - YERSIN 2018 guerin pierre-edouard from reference database obitools fasta files, it creates a stampa-format reference database """ #=============================================================================== #USAGES #=============================================================================== """ input : path to FASTA obitools file with "species=" information field available output : path to write the STAMPA-format FASTA reference databasae file command : python3 ncbi_to_stampa.py -f input -o output """ #=============================================================================== #MODULES #=============================================================================== import Bio from Bio import SeqIO from Bio.Alphabet import IUPAC from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord import argparse from ete3 import NCBITaxa #=============================================================================== #ARGUMENTS #=============================================================================== parser = argparse.ArgumentParser(description='convert fasta obitools to stampa format') parser.add_argument("-o","--output", type=str) parser.add_argument("-f","--fasta",type=str) #=============================================================================== #MAIN #=============================================================================== args = parser.parse_args() outputFile = args.output fastaFile = args.fasta ncbi = NCBITaxa() mes_records=[] for seq_record in SeqIO.parse(fastaFile, "fasta",alphabet=IUPAC.unambiguous_dna): local_id=seq_record.id sr_description=seq_record.description.split(';') #print(local_id) #print(sr_description) for elem in sr_description: #print(elem.split('=')[0]) if len(elem) == 0: continue if elem.split('=')[0] == " species": cmd_get_dico="id_species="+str(elem.split('=')[1]) exec(cmd_get_dico) if id_species != -1: lineage=ncbi.get_lineage(id_species) names = ncbi.get_taxid_translator(lineage) all_tax_species="" for taxid in lineage[2:]: all_tax_species+=names[taxid] all_tax_species+="|" all_tax_species=all_tax_species.replace(" ","+") else: continue local_seq=Seq(str(repr(str(seq_record.seq.lower()))).replace("'",""),IUPAC.unambiguous_dna) local_record=SeqRecord(local_seq,id=seq_record.id,description=all_tax_species[:-1]) mes_records.append(local_record) SeqIO.write(mes_records, outputFile, "fasta")