Commit 8db81907 authored by peguerin's avatar peguerin
Browse files

stampa

parent eb3d0c37
Loading
Loading
Loading
Loading
+13 −1
Original line number Diff line number Diff line
@@ -31,6 +31,10 @@ You will also need to have the following programs installed on your computer.

- [OBItools](https://pythonhosted.org/OBITools/welcome.html#installing-the-obitools)
- [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home)
- [python 3.6.6](https://www.python.org/downloads/release/python-366/)
- `python 3 library` [biopython](https://biopython.org/)
- `python 3 library` [etetoolkit](http://etetoolkit.org/)


## Preparation

@@ -41,7 +45,7 @@ You will also need to have the following programs installed on your computer.
## Build a reference database

* Overview of the steps
	1. Download the sequences
	1. Download the sequences
	2. Download the taxonomy
	3. Use [ecoPCR](https://git.metabarcoding.org/obitools/ecopcr/wikis/home)
 to simulate an *in silico* PCR
@@ -73,3 +77,11 @@ You can use the absolute path of the folder of your reference database as the `/
* [nextflow_obitools](https://gitlab.mbb.univ-montp2.fr/edna/nextflow_obitools)
* [snakemake_only_obitools](http://gitlab.mbb.univ-montp2.fr/edna/snakemake_only_obitools)

## Reference database in STAMPA format

Once you generated you're reference database, you need the `db_{prefix}.fasta` file (see [results](Results) section).
To convert your reference database in STAMPA format type :
```
python3 scripts/formatDB_obifasta_to_stampa.py -f db_{prefix}.fasta -o /path/to/reference_database_stampa.fasta
```
It will generate a file `/path/to/reference_database_stampa.fasta` which you can use as a reference database into the pipeline [bash_swarm](https://gitlab.mbb.univ-montp2.fr/edna/bash_swarm)
 No newline at end of file
+72 −0
Original line number Diff line number Diff line
#===============================================================================
#INFORMATIONS
#===============================================================================
"""
CEFE - EPHE - YERSIN 2018
guerin pierre-edouard
from reference database obitools fasta files, 
it creates a stampa-format reference database
"""
#===============================================================================
#USAGES
#===============================================================================
"""
input :
path to FASTA obitools file with "species=" information field available
output :
path to write the STAMPA-format FASTA reference databasae file
command : 
python3 ncbi_to_stampa.py -f input -o output
"""
#===============================================================================
#MODULES
#===============================================================================
import Bio
from Bio import SeqIO
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import argparse
from ete3 import NCBITaxa
#===============================================================================
#ARGUMENTS
#===============================================================================

parser = argparse.ArgumentParser(description='convert fasta obitools to stampa format')
parser.add_argument("-o","--output", type=str)
parser.add_argument("-f","--fasta",type=str)
#===============================================================================
#MAIN
#===============================================================================
args = parser.parse_args()
outputFile = args.output
fastaFile = args.fasta
ncbi = NCBITaxa()
mes_records=[]
for seq_record in SeqIO.parse(fastaFile, "fasta",alphabet=IUPAC.unambiguous_dna):
    local_id=seq_record.id
    sr_description=seq_record.description.split(';')
    #print(local_id)
    #print(sr_description)
    for elem in sr_description:
        #print(elem.split('=')[0])
        if len(elem) == 0:
            continue
        if elem.split('=')[0] == " species":
            cmd_get_dico="id_species="+str(elem.split('=')[1])
            exec(cmd_get_dico)
            if id_species != -1:
                lineage=ncbi.get_lineage(id_species)
                names = ncbi.get_taxid_translator(lineage)
                all_tax_species=""
                for taxid in lineage[2:]:
                    all_tax_species+=names[taxid]
                    all_tax_species+="|"
                all_tax_species=all_tax_species.replace(" ","+")
            else:
                continue
    local_seq=Seq(str(repr(str(seq_record.seq.lower()))).replace("'",""),IUPAC.unambiguous_dna)
    local_record=SeqRecord(local_seq,id=seq_record.id,description=all_tax_species[:-1])
    mes_records.append(local_record)
SeqIO.write(mes_records, outputFile, "fasta")