README.md 4.76 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
# Custom Metabarcoding Reference Database

peguerin's avatar
peguerin committed
3
4
5
[![Twitter Follow](https://img.shields.io/twitter/follow/ephe_bev?style=social)](https://twitter.com/ephe_bev)


peguerin's avatar
peguerin committed
6

peguerin's avatar
peguerin committed
7
Scripts to convert FASTA files into reference database with NCBI taxonomy.
peguerin's avatar
peguerin committed
8
9
10

## Introduction

peguerin's avatar
peguerin committed
11
**mkbdr** is a python program designed to create reference database from FASTA file using the NCBI taxonomy. It also provides tools to assist and perform taxonomy curation on the input FASTA file.
peguerin's avatar
peguerin committed
12
13
14
15
16
17
18
19



## Workflow

* inputs:
    * FASTA file

peguerin's avatar
peguerin committed
20
21
22
23
24
25
26
27
28
29

1. Check FASTA format
2. Check species name format
3. Check DNA sequence format
4. Check species name against NCBI taxonomy
5. Attribute NCBI taxid
6. Write `valid`, `faulty_taxon` and `faulty_format` FASTA files
7. Curate species name using `curation` CSV file
8. Write new nodes in NCBI taxonomy for unattributed taxid species
9. Write `valid` FASTA files
peguerin's avatar
peguerin committed
30
31
32
33
34

* outputs:
    * formatted FASTA file
    * .ldx new nodes for missing taxid into the taxonomy to link to existing genus/family taxid

peguerin's avatar
peguerin committed
35

peguerin's avatar
peguerin committed
36
37
38
1. `raw fasta` --> validate --> `valid fasta` `faulty_format fasta` `faulty_taxon fasta`

2. `faulty_taxon fasta` --> curegen --> `curation csv`
peguerin's avatar
peguerin committed
39

peguerin's avatar
peguerin committed
40
3. verifier et corriger à la main le tableau `curation csv`
peguerin's avatar
peguerin committed
41

peguerin's avatar
peguerin committed
42
4. `raw fasta` `curation csv` --> validate --> `valid fasta` et mise à jour de la taxonomy
peguerin's avatar
peguerin committed
43
44


peguerin's avatar
peguerin committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
## Environment

To create environments with required softwares: 

```
conda env create -f envs/obitools_envs.yaml
conda env create -f envs/pylib_cbdr.yaml
```

* Obitools

```
conda activate obitools
```

* Required python libraries to build custom reference database

```
conda activate pylib_cbdr
```






peguerin's avatar
peguerin committed
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Usage

First time loading the taxdump

```
mkbdr validate --fasta resources/test/raw.fasta \
--ncbi_taxdump "TAXO/taxdump_2021.tar.gz" \
--output_prefix "test_raw"
```

taxdump previously loaded (faster)

```
mkbdr validate --fasta resources/test/raw.fasta \
--output_prefix "test_raw"
```

Apply curation

```
mkbdr validate --fasta resources/test/raw.fasta \
--curate curated_taxon.csv
--output_prefix "test_curated"
peguerin's avatar
peguerin committed
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
```

Generate a curation csv file

```
mkbdr curegen --fasta test_raw_faulty_taxon.fasta \
--output_prefix "test"
```

Specify the globalnames database to query

```
mkbdr curegen --fasta test_raw_faulty_taxon.fasta \
--output_prefix "test" \
--database_globalnames 'Catalogue of Life'
```



_______________________________________________________________________________


crash test
```
python3 mkbdr validate --fasta teleo_ok.fasta --curate curated_taxon.csv --output_prefix "truc"
peguerin's avatar
peguerin committed
119

peguerin's avatar
peguerin committed
120
cd TAXO/testouille; tar zxvf taxdump_2021.tar.gz ; cd ../../
peguerin's avatar
peguerin committed
121
122
123
124
125
126
127
python3 mkbdr validate --fasta teleo_ok.fasta --curate curated_taxon.csv --output_prefix "truc" --curate curated_taxon.csv --ncbi_taxdump TAXO/testouille --ncbi_taxdump_edition
```

obitools

```
conda activate obitools
peguerin's avatar
peguerin committed
128
ecotag -t TAXO/testouille -R truc_valide.fasta -m 0.95 -r nimp.fasta
peguerin's avatar
peguerin committed
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
```

## Taxonomy

(Thanks to the work of [Guy Leonard](https://github.com/guyleonard/taxdump_edit).)

### Structure of *.dmp files

As per NCBI's taxdump_readme.txt: Each of the files store one record in the single line that are delimited by "\t|\n" (tab, vertical bar, and newline) characters. Each record consists of one or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters. The brief description of field position and meaning for each file follows.

### nodes.dmp

This file represents taxonomy nodes. The description for each node includes the following fields:

```
tax_id                              -- node id in GenBank taxonomy database
parent tax_id                       -- parent node id in GenBank taxonomy database
rank                                -- rank of this node (superkingdom, kingdom, ...) 
embl code                           -- locus-name prefix; not unique
division id                         -- see division.dmp file
inherited div flag  (1 or 0)        -- 1 if node inherits division from parent
genetic code id	                    -- see gencode.dmp file
inherited GC  flag  (1 or 0)        -- 1 if node inherits genetic code from parent
mitochondrial genetic code id       -- see gencode.dmp file
inherited MGC flag  (1 or 0)        -- 1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0)        -- 1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0)   -- 1 if this subtree has no sequence data yet
comments                            -- free-text comments and citations
```

### names.dmp

Taxonomy names file has these fields:

```
tax_id					-- the id of node associated with this name
name_txt				-- name itself
unique name				-- the unique variant of this name if name not unique
name class				-- (synonym, common name, ...)
```

### Taxdump Files

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar zxvf taxdump.tar/gz
cd TAXO/testouille; tar zxvf taxdump_2021.tar.gz ; cd ../../