README.md 2.54 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
# Custom Metabarcoding Reference Database

peguerin's avatar
peguerin committed
3
4
5
[![Twitter Follow](https://img.shields.io/twitter/follow/ephe_bev?style=social)](https://twitter.com/ephe_bev)


peguerin's avatar
peguerin committed
6

peguerin's avatar
peguerin committed
7
Scripts to convert FASTA files into reference database linked to NCBI taxonomy.
peguerin's avatar
peguerin committed
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

## Introduction

scripts to create our own reference database with our own sequences only and using the NCBI taxonomy



## Workflow

* inputs:
    * FASTA file

0. get raw fasta files of new sequences with species-names
1. Extract sequence name
2. Check sequence name format
3. Check sequences format (iuapc ambiguity, gaps)
4. Correct NCBI-taxonomy species name (this is semi-automatic)
5. Attribute NCBI-taxonomy taxid
6. Extract names with missing taxid
    1. Attribute NCBI-taxonomy taxid of genus
    2. Run obitaxonommy command for unattributed taxid species
peguerin's avatar
peguerin committed
29
7. Write fasta file of sequences with their taxid and complete genus-species name
peguerin's avatar
peguerin committed
30
31
32
33
34

* outputs:
    * formatted FASTA file
    * .ldx new nodes for missing taxid into the taxonomy to link to existing genus/family taxid

peguerin's avatar
peguerin committed
35
36
37
38
39
40
41
1. `raw fasta` --> validate --> `valide fasta` `faulty_format fasta` `faulty_taxon fasta`

2. `faulty_taxon fasta` --> curate (actuellement Laetitia qui fait ce job) --> `curated_taxon csv`

3. verifier et corriger à la main le tableau `curated_taxon csv`

4. `raw fasta` `curated_taxon csv` --> validate --> `valide fasta` et mise à jour de la taxonomy
peguerin's avatar
peguerin committed
42
43


peguerin's avatar
peguerin committed
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
## Environment

To create environments with required softwares: 

```
conda env create -f envs/obitools_envs.yaml
conda env create -f envs/pylib_cbdr.yaml
```

* Obitools

```
conda activate obitools
```

* Required python libraries to build custom reference database

```
conda activate pylib_cbdr
```






peguerin's avatar
peguerin committed
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
## Usage

First time loading the taxdump

```
mkbdr validate --fasta resources/test/raw.fasta \
--ncbi_taxdump "TAXO/taxdump_2021.tar.gz" \
--output_prefix "test_raw"
```

taxdump previously loaded (faster)

```
mkbdr validate --fasta resources/test/raw.fasta \
--output_prefix "test_raw"
```

Apply curation

```
mkbdr validate --fasta resources/test/raw.fasta \
--curate curated_taxon.csv
--output_prefix "test_curated"
peguerin's avatar
peguerin committed
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
```

Generate a curation csv file

```
mkbdr curegen --fasta test_raw_faulty_taxon.fasta \
--output_prefix "test"
```

Specify the globalnames database to query

```
mkbdr curegen --fasta test_raw_faulty_taxon.fasta \
--output_prefix "test" \
--database_globalnames 'Catalogue of Life'
```



_______________________________________________________________________________


crash test
```
python3 mkbdr validate --fasta teleo_ok.fasta --curate curated_taxon.csv --output_prefix "truc"
peguerin's avatar
peguerin committed
118
```