peguerin · 9a02789c
--- a/Files-definition.md
+++ b/Files-definition.md
@@ -2,7 +2,7 @@ There are several files that may be needed depending on the analysis. These file

 # Input Files

-* Representative sequences
+## Representative sequences

 The representative sequences must be stored as a FASTA file. See the definition of FASTA format on wikipedia [here](https://en.wikipedia.org/wiki/FASTA_format).

@@ -30,10 +30,63 @@ ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
 * `species_name=` is mandatory and must be the prefix of the species name
 * `Mullus surmuletus` is the species name in NCBI taxonom. It have to be exactly the same than the name in NCBI taxonomy otherwise MKBDR will result a taxonomy fault. The name of the species is composed of 2 words _Genus_ and _species_ separated by a delimiter. The delimiter can be `_` or ` `. Otherwise MKBDR will result a format fault.

-#### Sequence line:
+#### DNA sequence line:

 * Only `A`, `T`, `G`, `C` characters are allowed. IUAPC ambiguities will result a fatal error.
 * Gaps `-` are not allowed
 * Empty sequences are not allowed

+## NCBI taxonomies file

+MKBDR requires NCBI taxonomies to work properly. So, you need to provide to MKBDR the folder containing the NCBI taxonomies files.
+
+You can manually download the NCBI taxonomies file:
+
+```
+wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
+tar zxvf taxdump.tar.gz
+```
+
+Alternatively MKBDR can download and untar the NCBI taxonomies file at the required address:
+
+```
+mkbdr init_ncbi_taxdump --folder_path /target/ncbi_tax/
+```
+
+
+
+
+#### Structure of *.dmp files
+
+As per NCBI's taxdump_readme.txt: Each of the files store one record in the single line that are delimited by "\t|\n" (tab, vertical bar, and newline) characters. Each record consists of one or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters. The brief description of field position and meaning for each file follows.
+
+#### nodes.dmp
+
+This file represents taxonomy nodes. The description for each node includes the following fields:
+
+```
+tax_id                              -- node id in GenBank taxonomy database
+parent tax_id                       -- parent node id in GenBank taxonomy database
+rank                                -- rank of this node (superkingdom, kingdom, ...) 
+embl code                           -- locus-name prefix; not unique
+division id                         -- see division.dmp file
+inherited div flag  (1 or 0)        -- 1 if node inherits division from parent
+genetic code id	                    -- see gencode.dmp file
+inherited GC  flag  (1 or 0)        -- 1 if node inherits genetic code from parent
+mitochondrial genetic code id       -- see gencode.dmp file
+inherited MGC flag  (1 or 0)        -- 1 if node inherits mitochondrial gencode from parent
+GenBank hidden flag (1 or 0)        -- 1 if name is suppressed in GenBank entry lineage
+hidden subtree root flag (1 or 0)   -- 1 if this subtree has no sequence data yet
+comments                            -- free-text comments and citations
+```
+
+#### names.dmp
+
+Taxonomy names file has these fields:
+
+```
+tax_id					-- the id of node associated with this name
+name_txt				-- name itself
+unique name				-- the unique variant of this name if name not unique
+name class				-- (synonym, common name, ...)
+```