README.md 8.79 KB
Newer Older
peguerin's avatar
peguerin committed
1
2
3
4
5
6

# Genome assemblies collection

A collection of commands to assemble genome from NGS data.


peguerin's avatar
peguerin committed
7
8
9
10
11
12
13
14
15
16
17

## Prerequisites

* [platanus 1.2.4](http://platanus.bio.titech.ac.jp/)

* [supernova 2.1.1](https://support.10xgenomics.com/de-novo-assembly/software/downloads/latest)

* [longranger 2.2.2](https://support.10xgenomics.com/genome-exome/software/pipelines/latest/what-is-long-ranger)

* [arcs](https://github.com/bcgsc/arcs) please follow instruction into [install.sh](arcs/install.sh)

peguerin's avatar
peguerin committed
18
19
# Data files

peguerin's avatar
peguerin committed
20
21
22
23
## Description

We aims to assemble 3 _de novo_ draft genomes of one invididual of 3 different species

peguerin's avatar
peguerin committed
24
25
26
27
28
29
### Estimation of genome size

Species             |  ID        |  Estimated size (Mbp) | C-value
--------------------|------------|-----------------------|---------
_Diplodus sargus_   |  `sar`     |  567                  | 0.58
_Mullus surmuletus_ |  `mullus`  |  636                  | 0.65
peguerin's avatar
peguerin committed
30
_Serranus cabrilla_ |  `serran`  |  792                  | 0.81**
peguerin's avatar
peguerin committed
31

peguerin's avatar
peguerin committed
32
- Estimated size : genome size estimation with C-value from the [Animal Genome Size Database](http://www.genomesize.com). Formula is _Estimated size = C-value x 0,978 x 10⁹_.
peguerin's avatar
peguerin committed
33

peguerin's avatar
peguerin committed
34
**_As Serranus cabrilla was not available for Animal Genome Size Database, we used a C-value for Serranus hepatus_
peguerin's avatar
peguerin committed
35

peguerin's avatar
peguerin committed
36
The three draft genomes were sequenced using NGS technologies. 
peguerin's avatar
peguerin committed
37
38
39
40
41
42
43

### paired-end

* For each genome, two paired-end libraries with insert sizes of 350 bp and 550 bp.
* They are sequenced as 150 paired-end base reads on a Illumina HiSeq 4000 sequencer.
* There was 1-2 µg in a total volume of 50 µL for each of the samples in the paired-end libraries.
* Library preparation and sequencing was performed by [FASTERIS](https://www.fasteris.com/dna/)
peguerin's avatar
peguerin committed
44

peguerin's avatar
peguerin committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
### mate-pair

* For each genome, two mate-pair libraries with insert sizes of 3 kbp and 5 kbp were constructed.
* They are sequenced as 150 paired-end base reads on a Illumina HiSeq 4000 sequencer.
* There was 4 µg of DNA in a total volume of 50 µL for the samples used to build the mate-pair libraries.
* Library preparation and sequencing was done by [FASTERIS](https://www.fasteris.com/dna/)

### 10X Genomics

* For _Serranus cabrilla_, a linked-reads library from the Chromium technology were performed. 
* They require high molecular weight DNA: samples had to contain no less than 10 µg DNA, with a concentration of 110 ng/µl. Approximately 1 ng of HMW DNA was processed on a 10X Chromium instrument to prepare barcoded libraries.
* These libraries were sequenced on an Illumina HiSeq 2500 machine.
* Library preparation and sequencing was performed by [MGX](https://www.mgx.cnrs.fr).


## Set the initial directory structure
peguerin's avatar
peguerin committed
61
- pe_dir
peguerin's avatar
peguerin committed
62
63
64
65
	- lib350bp_R1.fastq
	- lib350bp_R2.fastq
	- lib550bp_R1.fastq
	- lib550bp_R2.fastq
peguerin's avatar
peguerin committed
66
- me_dir
peguerin's avatar
peguerin committed
67
68
69
70
	- lib3kbp_R1.fastq
	- lib3kbp_R2.fastq
	- lib5kbp_R1.fastq
	- lib5kbp_R2.fastq
peguerin's avatar
peguerin committed
71
- x_dir
peguerin's avatar
peguerin committed
72
73
74
75
	- Lib_10_S1_L002_I1_001.fastq
	- Lib_10_S1_L002_R1_001.fastq
	- Lib_10_S1_L002_R2_001.fastq

peguerin's avatar
peguerin committed
76
with `pe_dir` as a folder of paired-end sequencing results, `me_dir` as mate-pair and `x_dir` as linked-reads.
peguerin's avatar
peguerin committed
77

peguerin's avatar
peguerin committed
78
79
80
81
82


# Genome assembly methods

## Platanus
peguerin's avatar
peguerin committed
83
Platanus is a novel de novo sequence assembler that can reconstruct genomic sequences of highly heterozygous diploids from massively parallel shotgun sequencing data.
peguerin's avatar
peguerin committed
84

peguerin's avatar
peguerin committed
85
### 1. Contig assembling
peguerin's avatar
peguerin committed
86

peguerin's avatar
peguerin committed
87

peguerin's avatar
peguerin committed
88
```
peguerin's avatar
peguerin committed
89
platanus assemble -tmp temp/ -m 256 -t 64 -o serran_assemble -f pe_dir/*.fastq 2> assemble.log
peguerin's avatar
peguerin committed
90
```
peguerin's avatar
peguerin committed
91

peguerin's avatar
peguerin committed
92
### 2. Scaffoling
peguerin's avatar
peguerin committed
93

peguerin's avatar
peguerin committed
94
```
peguerin's avatar
peguerin committed
95
platanus scaffold -t 64 -tmp temp/ -c serran_assemble_contig.fa -b serran_assemble_contigBubble.fa -IP1 pe_dir/lib350bp_R*.fastq -IP2 pe_dir/lib550bp_R*.fastq -OP3 me_dir/lib3kbp_R*.fastq -OP4 lib5kbp_R*.fastq 2> scaffold.log
peguerin's avatar
peguerin committed
96
```
peguerin's avatar
peguerin committed
97

peguerin's avatar
peguerin committed
98
### 3. Gapclose
peguerin's avatar
peguerin committed
99
100
101
102
103
```
platanus gap_close -t 64 -tmp temp/ -o serran_hpc_gapclose  -c out_scaffold.fa -IP1 pe_dir/lib350bp_R*.fastq -IP2 pe_dir/lib550bp_R*.fastq -OP3 me_dir/lib3kbp_R*.fastq -OP4 lib5kbp_R*.fastq 2> gapclose.log
```


peguerin's avatar
peguerin committed
104
## Supernova
peguerin's avatar
peguerin committed
105

peguerin's avatar
peguerin committed
106
107
108
109
110
111
Supernova should be run using 38-56x coverage of the genome.
	- Somewhat higher coverage is sometimes advantageous.
	- Supernova will exit if it finds that coverage is far from the recommended range.
	- Note that at most 2.14 billion reads are allowed.
	- Please note that we have not extensively tested genomes larger than human, and any genome above approximately 4 GB should be considered experimental and is not supported.

peguerin's avatar
peguerin committed
112
### 1. De novo assembly
peguerin's avatar
peguerin committed
113
114
115
116
117
118
119

generate a whole genome _de novo_ assembly for serran

```
supernova run --id=serran --fastqs=x_dir/ --localmem=470 --maxreads=298666666
```

peguerin's avatar
peguerin committed
120
### 2. Generating phased genome sequences
peguerin's avatar
peguerin committed
121
122
123
124
125
126
127
128

Once serran's assembly has completed, we generate a FASTA file representing your assembly.

```
supernova mkoutput --style=pseudohap2 --asmdir=serran/outs/assembly --outprefix=serranus
```

A style `pseudohap2`, identified in FASTA records as style=4, generates a single record per scaffold , except that for each scaffold, two ‘parallel’ pseudohaplotypes are created and placed in separate FASTA files. Records in these files are parallel to each other. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles.
peguerin's avatar
peguerin committed
129
130


peguerin's avatar
peguerin committed
131
## ARCS
peguerin's avatar
peguerin committed
132

peguerin's avatar
peguerin committed
133
134
135
136
137
Scaffolding genome sequence assemblies using 10X Genomics Chromium data. In other words we use linked-reads information to improve genome assembly based on paired-end/mate-pair libraries.

see [arcs_pipeline.sh](arcs/pipeline.sh) for details.


peguerin's avatar
peguerin committed
138
## Measuring genome assemblies
peguerin's avatar
peguerin committed
139

peguerin's avatar
peguerin committed
140
141
142
```
bash measuring/genome_assembly_fasta_summarize.sh genome_assembly.fasta
```
peguerin's avatar
peguerin committed
143

peguerin's avatar
peguerin committed
144

peguerin's avatar
peguerin committed
145
# Results
peguerin's avatar
peguerin committed
146

peguerin's avatar
peguerin committed
147
## Summary statistics of genome assemblies
peguerin's avatar
peguerin committed
148
_CSV file of this table is available_ [here](results/resultats_assemblage_genome-1.csv)
peguerin's avatar
peguerin committed
149

peguerin's avatar
peguerin committed
150
151
152
153
154
155
156
157
Species             |  Genome assembler  |  Computing platform  |  Library                                                                 |  # of contigs  |  Contig N50  |  Total size(Mbp)  |  # of scaffolds  |  Scaffold N50  |  Coverage
--------------------|--------------------|----------------------|--------------------------------------------------------------------------|----------------|--------------|--------------------|------------------|----------------|----------
_Diplodus sargus_   |  Platanus          |  MBB                 |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  3030501       |  991         |  856               |  4227            |  2660334       |  53X
_Diplodus sargus_   |  Platanus          |  MESO@LR             |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  2408078       |  1101        |  785               |  2344            |  3371708       |  57X
_Mullus surmuletus_ |  Platanus          |  MBB                 |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  3616310       |  344         |  653               |  5417            |  192808        |  69X
_Mullus surmuletus_ |  Platanus          |  MESO@LR             |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  3146055       |  384         |  613               |  2940            |  488370        |  74X
_Mullus surmuletus_ |  Abyss2            |  MBB                 |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  36011115      |  96          |  686               |  4938            |  17739         |  66X
_Serranus cabrilla_ |  Platanus          |  MBB                 |  Paired-end 350bp & 550bp insert size Mate-pair 3Kbp & 5Kbp insert size  |  2169385       |  1135        |  627               |  2190            |  613541        |  63X
peguerin's avatar
peguerin committed
158
_Serranus cabrilla_ |  `Supernova`       |  Both                |  `Chromium Linked-reads`                                                 |  NA            |  NA          |  `223`             |  `4951`          |  `67074`       |  96X;`75X`
peguerin's avatar
peguerin committed
159
_Serranus cabrilla_ |  ARCS              |  MBB                 |  Both                                                                    |  NA            |  NA          |  627               |  2122            |  624679        |  63X
peguerin's avatar
peguerin committed
160
161
162
163


- Species : the species of the organism we sequenced
- Genome assembler : the software/workflow we used to perform genome assembly
peguerin's avatar
peguerin committed
164
165
- Computing platform : The high performance platform we used to perform genome assembly 
	* [MBB](https://mbb.univ-montp2.fr/MBB/index.php) is 64 cores and 512Go RAM
peguerin's avatar
peguerin committed
166
	* [MESO@LR](https://meso-lr.umontpellier.fr/) is 80 cores and 1To RAM
peguerin's avatar
peguerin committed
167
- Library : see [data description](#data-files)
peguerin's avatar
peguerin committed
168
169
170
171
- Number of contigs : number of set of overlapping DNA segments
- Contig N50 : size of the contigs from which contigs which are larger represents half of the total genome size
- Number of scaffolds : number of set of linked-contigs
- Scaffold N50 : size of the scaffold from which scaffolds which are larger represents half of the total genome size
peguerin's avatar
peguerin committed
172
- Coverage : defined as C = L x N / G with C the coverage, L the length of a read, N the number of reads, G the total genome size. For `Supernova` assembler we give coverage with raw NGS data and coverage with threshold of maximum number of reads allowed by the program to perform assembly.
peguerin's avatar
peguerin committed
173