Changes

peguerin · 3d328d08
--- a/Reference.md
+++ b/Reference.md
+# Workflow management
+
+We use [snakemake](https://snakemake.readthedocs.io/en/stable/), a workflow management system to create scalable and reproducible metabarcoding analysis.
+
+Snakemake uses wildcards and rules. Rules describe a shell command defined by input and output variables. Wildcards describe the generalization of the values taking by input and output variables. Due to limitation of wildcards (at the demultiplexing and concatening steps, the workflow is splitted into 5 snakemake workflow.
+
+* **01_settings** produces the table to define wildcards
+* **02_assembly** merges `run` paired-end .fastq files
+* **03_demultiplex** generates `projet`/`marker`/`run`/`sample` .fasta files from `run` merged .fastq files 
+* **04_filter_samples** filters `projet`/`marker`/`run`/`sample` .fasta files
+* We concatenate  `projet`/`marker`/`run`/`sample` .fasta files into `projet`/`marker`/`run` .fasta files
+* **05_assignment** produces `projet`/`marker`/`run` species occurence table files
+
+Each workflow is stored with the following structure:
+```
+├── workflow
+│   ├── rules
+|   │   ├── module1.smk
+|   │   └── module2.smk
+│   ├── envs  
+|   │   └── tools.yaml
+|   └── Snakefile
+├── config
+│   └── config.yaml
+└── results
+    └── workflow
+        ├── module1
+        └── module2
+```
+The workflow code goes into a subfolder `workflow`, while the configuration is stored in a subfolder `config`. Inside of the `workflow` subfolder, the central `Snakefile` marks the entrypoint of the workflow. Results are written into subfolder `results. Inside of the `results` subfolder, results are stored following the same structure than inside `workflow` subfolder.
+
+# Wildcards
+
+Output and input can take any values. We defined them as wildcards. We use 4 wildcards:
+
+* `projet`
+* `marker`
+* `run`
+* `sample`
+
+# Rules
+
+A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps by specifying how to create sets of output files from sets of input files.
+
+## Overview
+
+Snakefile in green, set of input/output files in yellow, configuration files in white, rules in grey.
+
+```mermaid
+
+graph TD;   
+  classDef Fichier fill:#ffff00;
+  classDef Configurer fill:#E0FFFF;
+  classDef Snakefile fill:#98FB98,border:#0c4c0c;
+
+
+ 
+  id0s{{`marker` sample description .dat}}-->s0;
+  id0c{{config.yaml}}-->s0;
+ id0a{{rapdirun.tsv}}-->s0;
+
+  s0[readwrite_rapidrun_demultiplexing]-->s1[assembly];
+  s1-->s2[demultiplexing];
+  s2-->s3[filtering];
+  s3-->s4[assignment];
+
+  s0--> id0b{{demultiplex.csv}};
+
+
+  id1[`run` paired-end .fastq files]-->r1(illuminapairedend);
+
+  subgraph g1[assembly]
+  r1(illuminapairedend)-->r2(remove_unaligned);
+  click r1 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/02_assembly/rules/illuminapairedend.smk";
+  click r2 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/02_assembly/rules/remove_unaligned.smk";
+  end
+  
+  r2-->id2[`run` annotated merged .fastq files];  
+  id2-->r3
+
+  subgraph g2[demultiplexing]
+  r3(assign_sequences)-->r4(split_sequences);  
+  click r3 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/03_demultiplex/rules/assign_sequences.smk";
+  click r4 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/03_demultiplex/rules/split_sequences.smk";  
+  end
+
+  r4-->id3[`projet`/`marker`/`run`/`sample` annotated .fasta files];
+  id3-->r5
+  
+  subgraph g3[filtering]
+  r5(dereplicate_samples)-->r6(goodlength_samples);
+  r6-->r7(clean_pcrerr_samples);
+  r7-->r8(rm_internal_samples);  
+  click r5 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/04_filter_samples/rules/dereplicate_samples.smk";
+  click r6 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/04_filter_samples/rules/goodlength_samples.smk";
+  click r7 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/04_filter_samples/rules/clean_pcrerr_samples.smk";
+  click r8 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/04_filter_samples/rules/rm_internal_samples.smk";  
+  end
+
+  r8-->id4[`projet`/`marker`/`run`/`sample` filtered .fasta files];
+  id4-->|concatenate `sample` into `run` files|id5[`projet`/`marker`/`run` .fasta files];  ;
+
+  id5-->r9;
+
+
+  subgraph g4[assignment]
+  
+  r9(dereplicate_runs)-->r10(assign_taxon);
+  r10-->r11(rm_attributes);
+  r11-->r12(sort_runs);
+  r12-->r13(table_runs);  
+  click r9 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/rules/dereplicate_runs.smk";
+  click r10 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/rules/assign_taxon.smk";
+  click r11 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/rules/rm_attributes.smk";
+  click r12 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/rules/sort_runs.smk";
+  click r13 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/rules/table_runs.smk";  
+  end
+
+  r13-->id6[`projet`/`marker`/`run` species occurence .tsv files];
+
+
+  
+  
+  id0a-->s1; 
+
+
+
+  id0b-->s2;
+  id0b-->s3;
+  id0b-->s4;
+
+
+  click id0c "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/config";
+  click id0b "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/results/01_settings"
+  click id0a "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/resources/test/all_samples.tsv";
+  click id0s "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/resources/test/sample_description";
+
+  
+
+  click s0 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/01_settings/readwrite_rapidrun_demultiplexing.py";
+  click s1 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/02_assembly/Snakefile";
+  click s2 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/03_demultiplex/Snakefile";
+  click s3 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/04_filter_samples/Snakefile";
+  click s4 "https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/05_assignment/Snakefile";
+  
+  class id0a,id0b,id0c,id0s Configurer;
+  class s0,s1,s2,s3,s4,g1,g2,g3,g4 Snakefile;
+  class id1,id2,id3,id4,id5,id6 Fichier; 
+
+
+```
+
+## Description
+
+### 0. Configuration
+
+The [config file](config/) defines a dictionary of configuration parameters and their values. These values will be used by the workflows.
+
+| parameters | descriptions | softwares | rules | default values | excepted type |
+|------------|--------------|-----------|-------|-----------|--------------------|
+| singularity                    | absolute path of singularity container file [![https://www.singularity-hub.org/static/simg/hosted-singularity--hub-%23e32929.svg](https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg)](https://singularity-hub.org/collections/2878) | [singularity](https://singularity.lbl.gov/) | every rules need this container to work | /workdir/conteneur/obitools.simg | absolute path file |
+| fichiers: rapidrun              | absolute path of the rapidrun .tsv file | [readwrite_rapidrun_demultiplexing](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/01_settings/readwrite_rapidrun_demultiplexing.py) | settings | resources/test/all_samples.tsv | absolute path file |
+| fichiers: folder_fastq          | absolute path of a folder which contains pairend-end raw reads .fastq.gz |  [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend) | illuminapairedend | /workdir/ngs/ | absolute path folder |
+| dat: `marker`                   | absolute path of `marker` sample description .dat file | [ngsfilter](https://pythonhosted.org/OBITools/scripts/ngsfilter.html) | assign_sequences | resources/test/sample_description/`marker`.dat | dictionnary `marker`: absolute path of file |
+| blacklist: projet               | list of `projet` to exclude from the analysis | [readwrite_rapidrun_demultiplexing](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/01_settings/readwrite_rapidrun_demultiplexing.py) | settings | dummy_projet | `projet` wildcards value |
+| blacklist: run                  | list of `run` to exclude from the analysis | [readwrite_rapidrun_demultiplexing](01_settings/readwrite_rapidrun_demultiplexing.py) | settings | dummy_projet | `run` wildcards value |
+| illuminapairedend: s_min        | score for keeping alignment. If the alignment score is below this threshold both the sequences are  just  concatenated. The mode attribute is set to the value joined | [illuminapairedend](https://pythonhosted.org/OBITools/scripts/illuminapairedend.html?highlight=illumina#module-illuminapairedend) | illuminapairedend | 40 | integer |
+| good_length_samples: seq_count  |  minimum number of copy for keeping a sequence | [obigrep](https://pythonhosted.org/OBITools/scripts/obigrep.html?highlight=obigrep#module-obigrep) | good_length_samples | 1 | integer |
+| good_length_samples: seq_length | minimum length for keeping a sequence | [obigrep](https://pythonhosted.org/OBITools/scripts/obigrep.html?highlight=obigrep#module-obigrep) | good_length_samples | 23 | integer
+| clean_pcrerr_samples: r         | threshold ratio between counts  (rare/abundant  counts)  of  two sequence records  so that the less abundant one is a variant of the more abundant | [obiclean](https://pythonhosted.org/OBITools/scripts/obiclean.html?highlight=obiclean#module-obiclean) | clean_pcrerr_samples | 0.05 | float |
+| assign_taxon: bdr: `marker`      | absolute path to the folder of `marker` reference database and prefix | [ecotag](https://pythonhosted.org/OBITools/scripts/ecotag.html?highlight=ecotag#module-ecotag) | assign_taxon | /workdir/reference_database/`marker`/embl_std | absolute path of a folder + prefix |
+| assign_taxon: fasta: `marker`    |  absolute path to the .fasta file of the `marker` reference database | [ecotag](https://pythonhosted.org/OBITools/scripts/ecotag.html?highlight=ecotag#module-ecotag) | assign_taxon | /workdir/reference_database/`marker`/db_embl_std.fasta | absolute path file |
+
+
+### 1. Settings
+
+[01_settings/readwrite_rapidrun_demultiplexing.py](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/01_settings/readwrite_rapidrun_demultiplexing.py): write the demultiplex.csv file that the Snakefiles will read to define their wildcards.
+* inputs:
+  * [sample description .dat files](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/resources/test/sample_description): a table with 6 columns (plaque, plaque1, barcode, primer5, primer3, infos) and rows as a `plaque` element description. Each sample description file belong to a `marker` wildcard.
+  * [config.yaml](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/config): see configuration step
+  * [rapidrun.tsv](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/blob/master/resources/test/all_samples.tsv) : a table with 5 columns (plaque, run, sample, projet ,marker) and rows as `projet`/`marker`/`run`/`plaque`==`sample` element description.
+* output:
+  * [results/01_settings/demultiplexing.csv](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/results/01_settings) : a dataframe with 14 columns (demultiplex, projet, marker, run, plaque, sample ,barcode5, barcode3 , primer5, primer3, min_f, min_r, lenBarcode5, lenBarcode3) and rows as `projet`/`marker`/`run`/`plaque`==`sample` element description.
+
+
+
+
+
+
+```mermaid
+
+classDiagram
+
+Sample --> Marker : is defined by
+Sample --> Run : is defined by
+Sample --> Projet : is defined by
+Projet --> Run : contains
+Marker --> MarkerSampleDescription : refers to
+
+
+Sample : id_sample
+Sample : id_marker
+Sample : id_plaque
+Sample : id_run
+Sample : id_projet
+
+Marker : id_marker
+Marker : .dat file path
+
+MarkerSampleDescription : id_plaque
+MarkerSampleDescription : barcode3'
+MarkerSampleDescription : barcode5'
+MarkerSampleDescription : primer3'
+MarkerSampleDescription : primer5'
+
+Projet : id_projet
+Projet : name
+
+Run : id_run
+Run : name
+Run : R1 .fastq.gz file path
+Run : R2 .fastq.gz file path
+```
+
+
+### 2 Assembly
+
+### 3 Demultiplexing
+
+
+### 4 Filtering
+
+### 5 Taxonomic assignment and format
+
+
+
+
+## write demultiplex table
+
+
+
+## 
+
+
+
+### read 'rapidrun' .tsv file
+### remove blacklisted runs & projects
+### write table projet/run/sample
+demultiplex","projet", "marker","run", "plaque","sample","barcode5","barcode3","primer5","primer3","min_f","min_r","lenBarcode5","lenBarcode3"
+
+## assemble
+### Paired end alignment then keep reads with quality > 40
+### Remove unaligned sequence records
+## demultiplex
+### Assign each sequence record to the corresponding sample/marker combination
+### Split the input sequence file in a set of subfiles according to the values of attribute `sample`
+## filter samples
+### dereplicate reads into uniq sequences
+### only sequence more than 20bp with no ambiguity IUAPC with total coverage greater than 10 reads
+### Clean the sequences for PCR/sequencing errors (sequence variants)
+### Remove sequence which are classified as 'internal' by obiclean
+## concatenate samples into run
+## assignment
+### Dereplicate and merge samples together
+### Assign each sequence to a taxon
+### Some unuseful attributes can be removed at this stage
+### The sequences can be sorted by decreasing order of count
+### Generate a table final results
+
+
+
+
+
+# Environment
+
+Softwares and dependencies can be run directly on the local system or using environments such as containers or using a package management system.
+
+## Containers
+
+[![https://www.singularity-hub.org/static/simg/hosted-singularity--hub-%23e32929.svg](https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg)](https://singularity-hub.org/collections/2878)
+
+We provide ready to run versions of container built with [Singularity containers](https://www.sylabs.io/). All required softwares to run the workflow have been installed within this container. User can either download the ready-to-use built container OR build this container instead to download it using the [Singularity.obitools](Singularity.obitools) recipe. In both case, it gives an `obitools.simg` file. Absolute path to access to the container file fills the field `singularity:` into [config.yaml](config/)
+
+## Conda
+
+Softwares can be installed throught a conda environment. Each rule loads its own environment. Environment files are stored at `workflow/envs/obitools_envs.yaml`.
+