... | ... | @@ -2,7 +2,7 @@ There are several files that may be needed depending on the analysis. These file |
|
|
|
|
|
# Input Files
|
|
|
|
|
|
* Representative sequences
|
|
|
## Representative sequences
|
|
|
|
|
|
The representative sequences must be stored as a FASTA file. See the definition of FASTA format on wikipedia [here](https://en.wikipedia.org/wiki/FASTA_format).
|
|
|
|
... | ... | @@ -30,10 +30,63 @@ ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC |
|
|
* `species_name=` is mandatory and must be the prefix of the species name
|
|
|
* `Mullus surmuletus` is the species name in NCBI taxonom. It have to be exactly the same than the name in NCBI taxonomy otherwise MKBDR will result a taxonomy fault. The name of the species is composed of 2 words _Genus_ and _species_ separated by a delimiter. The delimiter can be `_` or ` `. Otherwise MKBDR will result a format fault.
|
|
|
|
|
|
#### Sequence line:
|
|
|
#### DNA sequence line:
|
|
|
|
|
|
* Only `A`, `T`, `G`, `C` characters are allowed. IUAPC ambiguities will result a fatal error.
|
|
|
* Gaps `-` are not allowed
|
|
|
* Empty sequences are not allowed
|
|
|
|
|
|
## NCBI taxonomies file
|
|
|
|
|
|
MKBDR requires NCBI taxonomies to work properly. So, you need to provide to MKBDR the folder containing the NCBI taxonomies files.
|
|
|
|
|
|
You can manually download the NCBI taxonomies file:
|
|
|
|
|
|
```
|
|
|
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
|
|
|
tar zxvf taxdump.tar.gz
|
|
|
```
|
|
|
|
|
|
Alternatively MKBDR can download and untar the NCBI taxonomies file at the required address:
|
|
|
|
|
|
```
|
|
|
mkbdr init_ncbi_taxdump --folder_path /target/ncbi_tax/
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Structure of *.dmp files
|
|
|
|
|
|
As per NCBI's taxdump_readme.txt: Each of the files store one record in the single line that are delimited by "\t|\n" (tab, vertical bar, and newline) characters. Each record consists of one or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters. The brief description of field position and meaning for each file follows.
|
|
|
|
|
|
#### nodes.dmp
|
|
|
|
|
|
This file represents taxonomy nodes. The description for each node includes the following fields:
|
|
|
|
|
|
```
|
|
|
tax_id -- node id in GenBank taxonomy database
|
|
|
parent tax_id -- parent node id in GenBank taxonomy database
|
|
|
rank -- rank of this node (superkingdom, kingdom, ...)
|
|
|
embl code -- locus-name prefix; not unique
|
|
|
division id -- see division.dmp file
|
|
|
inherited div flag (1 or 0) -- 1 if node inherits division from parent
|
|
|
genetic code id -- see gencode.dmp file
|
|
|
inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent
|
|
|
mitochondrial genetic code id -- see gencode.dmp file
|
|
|
inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent
|
|
|
GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage
|
|
|
hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet
|
|
|
comments -- free-text comments and citations
|
|
|
```
|
|
|
|
|
|
#### names.dmp
|
|
|
|
|
|
Taxonomy names file has these fields:
|
|
|
|
|
|
```
|
|
|
tax_id -- the id of node associated with this name
|
|
|
name_txt -- name itself
|
|
|
unique name -- the unique variant of this name if name not unique
|
|
|
name class -- (synonym, common name, ...)
|
|
|
``` |