Commit dbdc4469 authored by peguerin's avatar peguerin
Browse files

readme update

parent 899a5775
# snakemake_rapidrun_swarm
OTU clustering with SWARM on RAPIDRUN data encapsulated in SNAKEMAKE
OTU clustering based on [TARA Fred's metabarcoding pipeline](https://github.com/frederic-mahe/swarm/wiki/Fred%27s-metabarcoding-pipeline) applied on RAPIDRUN data managed with [SNAKEMAKE](https://snakemake.readthedocs.io/en/stable/)
# Installation
# Prerequisites
## Prerequisites
* linux system
* [python3](https://www.python.org/)
* [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) or [pip3](https://pip.pypa.io/en/stable/)
* python3
* snakemake
* singularity
python3 dependencies
## Installation via Conda
The default conda solver is a bit slow and sometimes has issues with selecting the latest package releases. Therefore, we recommend to install Mamba as a drop-in replacement via
```
pip3 install pandas
pip3 install biopython
conda install -c conda-forge mamba
```
python3 dependencies to run `snakemake`:
Then, you can install Snakemake, pandas, biopython and dependencies with
```
pip3 install datrie
pip3 install ConfigArgParse
pip3 install appdirs
pip3 install gitdb2
mamba create -n snakemake_rapidrun -c conda-forge -c bioconda snakemake biopython pandas
```
# Configuration / input files
You have to set 2 files:
* [01_infos/all_samples.tsv](01_infos/all_samples.tsv)
* [01_infos/config.yaml](01_infos/config_test.yaml)
# Run the workflow
from the conda-forge and bioconda channels. This will install all required software into an isolated software environment, that has to be activated with
```
CORES=32
CONFIGFILE="01_infos/config_test.yaml"
bash main.sh $CORES $CONFIGFILE
conda activate snakemake_rapidrun
```
# Run from scratch
# Get started
## clone repositories
* open a shell
* clone the project and switch to the main folder, it's your working directory
```
git clone git@gitlab.mbb.univ-montp2.fr:edna/snakemake_rapidrun_swarm.git
git clone https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_swarm
cd snakemake_rapidrun_swarm
```
## write demultiplexing table
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
* `fichiers:` `dat`
* `blacklist:`
* :warning: the colon "marker" into `fichiers:` `rapidrun` must be the same name as marker's keys of `fichiers:` `dat` into `01_infos/config_test.yaml`
* This will generate a file `01_infos/all_demultiplex.csv` with each line arguments for each command to run in parallel.
* `blacklist:` `projet:` contains a list of projects you don't want to proceed
* `blacklist:` `run:` contains a list of runs you don't want to proceed
```
snakemake --configfile 01_infos/config_test.yaml -s readwrite_rapidrun_demultiplexing.py
```
## merge fastq
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
* `blacklist:`
* Deduce the `{run}` fastq paired-end to merge
* `blacklist:` `projet:` contains a list of projects you don't want to proceed
* `blacklist:` `run:` contains a list of runs you don't want to proceed
```
cd 02_assembly
snakemake --configfile $CONFIGFILE -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 20
cd ..
```
## demultiplexing
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
* will generate demultiplexed .fasta for each `{projet}`/`{marker}`/`{sample}` into `03_demultiplexing`
* Activate the conda environment to access the required dependencies
```
cd 03_demultiplexing
snakemake --configfile $CONFIGFILE -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 20
cd ..
conda activate snakemake_rapidrun
```
## cat qualities
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
You are ready to run the analysis !
* will concatenate and format .qual files by `{projet}`/`{marker}`
```
cd 04_cat_quality
snakemake --configfile $CONFIGFILE -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 20
cd ..
```
## clustering
## Download data
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
* `clustering:` `swarm:` `cores`
* will cluster sequences by Molecular Operational Taxonomic Unit (MOTU)
The complete data set can be downloaded and stored into [resources/tutorial](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_obitools/-/tree/master/resources/tutorial) folder with the following command:
```
cd 05_clustering
snakemake --configfile $CONFIGFILE -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 20
cd ..
wget -c https://gitlab.mbb.univ-montp2.fr/edna/tutorial_metabarcoding_data/-/raw/master/tutorial_rapidrun_data.tar.gz -O - | tar -xz -C ./resources/tutorial/
```
# Run the workflow
## assignment
From
- `01_infos/config_test.yaml`
* `fichiers:` `rapidrun`
* `assignment:` `ecotag:` `minIdentity`
* will assign each MOTU to a species
Simply type the following command to process data (estimated time: 20 minutes)
```
cd 06_assignment
snakemake --configfile $CONFIGFILE -s Snakefile -j $CORES --use-singularity --singularity-args "--bind /media/superdisk:/media/superdisk" --latency-wait 20
cd ..
bash main.sh
```
# To go further
Please check the [wiki](https://gitlab.mbb.univ-montp2.fr/edna/snakemake_rapidrun_swarm/-/wikis/home).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment