Commit c6ac16d1 authored by Bastien Macé's avatar Bastien Macé
Browse files

learnerrors step

parent 3b726828
#STEP 1 : Be prepared
## load the package :
library("dada2")
## create a path to your ".fastq" files :
path <- "./edna_intra_pipeline_comparison/samples"
## select the ".fastq" files you want to analyze :
fns <- sort(list.files(path, pattern = ".fastq", full.names = T))
## the function only extracts files that end with the chosen pattern and
## they are extracted with their whole path
## then you can only keep the part of your files name you want :
sample.names <- sapply(strsplit(basename(fns), ".fastq"), '[', 1)
## the function "basename" removes all the path up to the file name
## the function "strsplit" removes the pattern written
########################################################################
#STEP 2 : Filtering & Trimming
## begin the creation of the new files and folder :
filts <- file.path(path, "filtered", paste0(sample.names, ".filt.fastq.gz"))
## builds the path to the new folder, which will be located in the path
## already used and which name will be "filtered"
## the files are named as described before with "sample.names", and
## the pattern ".filt.fastq.gz" will be added
## from the ".fastq files" of "fns", create the new ".fastq" files of
## "filts" after filtering and trimming :
out <- filterAndTrim(fns, filts,
truncLen = 235,
maxN = 0,
maxEE = 1,
compress = T,
verbose = T)
## "truncLen" value is chosen considering the marker length and define
## were the reads will be trimmed (after 235 bp here), and reads which
## are shortened than this value are filtered
## "maxN" is the number of N tolerated in the sequences after
## filtering (0 here)
## "maxEE" defines the maximal number of expected errors tolerated in a
## read (1 here), based on the quality score (EE = sum(10^(-Q/10)))
## "compress = T" means that the files will be gzipped
## "verbose = T" means that information concerning the number of sequences after
## sequences after filtering will be given
########################################################################
#STEP 3 : Dereplication
## "derepFastq" function eliminates all the replications of each sequence in the files
derep <- derepFastq(filts)
## the function annotates each sequence with his abundance
########################################################################
#STEP 3 : Processing step
## these functions permit to generate ASVs from the data thanks to an
## error estimation model based on the data :
err <- learnErrors(derep[k], randomize=T)
## builds the error model
dadas <- dada(derep[k], err)
## eliminates the false sequences identified by the model to oncly
## conserve ASVs
seqtab <- makeSequenceTable(dadas)
## constructs a sequence table with the sequences filtered
uniqueSeqs <- getUniques(seqtab)
uniquesToFasta(uniqueSeqs, paste0("PipelineB_", sample.names[k], ".fasta"))
## creates a new ".fasta" file constaining the ASVs
\ No newline at end of file
......@@ -239,16 +239,40 @@ The OBITOOLS command used in pipelines A is _obiclean_. This command eliminates
Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".
By only conserving the sequences tagged "head", most of erroneous sequences are eliminated.
The following line is lanched in a shell, after the R pre-processing steps :
```
obiclean -r 0.05 -H Aquarium_2.fasta > Aquarium_2.clean.fasta
# here, the command only returns only the sequences tagged "head" by the algorithm, and the chosen ratio is 0.05
```
By only conserving the sequences tagged "head", most of erroneous sequences are eliminated.
<a name="step22"></a>
### IV - 2 - DADA2 processing step (Pipelines B)
The DADA2 function used in pipelines B is _learnErrors_. This function is able to distinguish the incorrect sequences from the correct sequences generated during amplification and sequencing, by estimating the sequencing error rate.
To build the error model, the function alternates estimation of the error rate and inference of sample composition until they converge on a jointly consistent solution.
The algorithm calculates the abundance p-value for each sequence. This p-value is defined by a Poisson distribution, with a parameter correspondig to the rate of amplicons of a sequence i generated from a sequence j.
Before that, a partition is built with the most abundant sequence as the core. All the other sequences are compared to this core. The sequence with the smallest p-value is analyzed : if this p-value is inferior than a parameter of the algorithm (OMEGA_A), this sequence become the core of a new partition. The other sequences joins the partition most likely to have produced the core. This operation is repeated until there is no p-value which falls under the parameter OMEGA_A.
Then, all the sequences from a partition are transformed into their core, so each partition corresponds to a unique sequence : the ASV (Amplicon sequence variant).
The following lines are lanched in R following the R pre-processing steps :
```
err <- learnErrors(derep[k], randomize=T)
# builds the error model
dadas <- dada(derep[k], err)
# eliminates the false sequences identified by the model to oncly conserve ASVs
seqtab <- makeSequenceTable(dadas)
# constructs a sequence table with the sequences filtered
uniqueSeqs <- getUniques(seqtab)
uniquesToFasta(uniqueSeqs, paste0("PipelineB_", sample.names[k], ".fasta"))
# creates a new ".fasta" file constaining the ASVs
```
<a name="step23"></a>
### IV - 3 - SWARM processing step (Pipelines C)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment