- folder which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* run the pipeline :
* define 2 folders into the current directory :
- folder `bdr` which contains reference database files. You can built a reference database by following the instructions [here](projet_builtdatabase).
- folder `raw` which contains pairend-end raw reads `.fastq.gz` files and the sample description `.dat` files. Raw reads files from the same pair must be named as `*_R1.fastq.gz` and `*_R2.fastq.gz` where wildcard `*` is the name of the sequencing run. The alphanumeric order of the names of sample description `.dat` files must be the same than the names of paired-end raw reads `.fastq.gz` files. The sample description file is a text file where each line describes one sample. Columns are separated by space or tab characters. Sample description file is described [here](https://pythonhosted.org/OBITools/scripts/ngsfilter.html).
* Overview of the steps
0. Configuration
1. Merge illumina paired-end sequences by pair
2. Assign each merged sequence to the corresponding sample
3. Dereplicates sequences
4. Filter unique sequences according to their qualities and abundances
5. Remove singleton and PCR errors
6. assign each sequences to a species
7. write a matrix species/sample
## 4.2 Configuration
Parameters for each program are stored into the file [config.yaml](config.yaml)
Before to run the pipeline, you have to set your paramaters. Please edit [config.yaml](config.yaml).
```fill
illuminapairedend:
- s_min : 40
good_length_samples:
- count : 10
- seq_length : 20
clean_pcrerr_samples:
- r : 0.05
assign_taxon:
- bdr : bdr/embl_std
- fasta : bdr/db_std.fasta
```
*`s_min : 40` :score for keeping alignment. If the alignment score is below this threshold both the sequences are just concatenated. The mode attribute is set to the value joined.
- software : `illuminapairedend`
- step : merge illumina paired-end sequences by pair
- we set this value at 40
*`count : 10` : minimum number of copy for keeping a sequence.
- software : `obigrep`
- step : filter unique sequences according to their qualities and abundances
- we set this value at 10
*`seq_length : 20` : minimum length for keeping a sequence.
- software : `obigrep`
- step : filter unique sequences according to their qualities and abundances
- we set this value at 20
*`r : 0.05` : threshold ratio between counts (rare/abundant counts) of two sequence records so that the less abundant one is a variant of the more abundant
- software : `obiclean`
- step : remove singleton and PCR errors
- we set this value at 0.05
*`bdr : bdr/embl_std` : relative path to the folder `bdr` which contains reference database files. You have to add the prefix of the ref database files for instance "embl_something"
- software : `ecotag`
- step : assign each sequences to a species
*`fasta : bdr/db_std.fasta` : relative path to the fasta file of the reference database.