Commit 350e6e3b authored by mmassaviol's avatar mmassaviol
Browse files

Update old workflow RNAseq

parent 80df60f1
# Documentation Wapps
- [Documentation Wapps](#documentation-wapps)
- [1 Construire un conteneur](#1-construire-un-conteneur)
- [2 Utiliser un conteneur](#2-utiliser-un-conteneur)
- [2.1 Avec interface graphique](#21-avec-interface-graphique)
- [2.2 En ligne de commande](#22-en-ligne-de-commande)
- [2.2.1 En local](#221-en-local)
- [2.2.2 Sur un cluster](#222-sur-un-cluster)
- [3 Modifier un workflow](#3-modifier-un-workflow)
Le dépôt [Wapps](https://gitlab.mbb.univ-montp2.fr/mmassaviol/wapps) (Workflow applications) contient les fichiers nécessaires pour créer des conteneurs d'application générés à l'aide du dépot [WAW](https://gitlab.mbb.univ-montp2.fr/mmassaviol/waw) (Workflow Application Wrapper).
Les outils sont wrappés sous la forme de règles **Snakemake** (système de gestion de workflow) et de fichiers de configuration **YAML**. Les workflow sont ensuite composés d'une suite d'outils listés dans un fichier de configuration **YAML**.
......@@ -8,12 +17,13 @@ A l'aide d'un script on génère le fichier snakefile, le fichier de paramètres
Liste des workflows existants:
+ Metabarcoding (using dada2)
+ Mito_Assembler_Megahit (Megahit + MitoZ)
+ MitoZ
+ RADseq_ref (stacks)
+ RADseq_denovo (stacks)
+ Virus_Assembler_Megahit (Megahit + Blast + BWA)
- Metabarcoding (using dada2)
- Mito_Assembler_Megahit (Megahit + MitoZ)
- MitoZ
- RADseq_ref (stacks)
- RADseq_denovo (stacks)
- Virus_Assembler_Megahit (Megahit + Blast + BWA)
- RNAseq (pseudoalign (kallisto, Salmon) + expression différentielle (Edger, DESeq2))
## 1 Construire un conteneur
......@@ -79,16 +89,16 @@ Pour remplir la variable samples il suffit de faire la liste des samples en pren
Exemple:
+ sample_1_R1.fastq.gz
+ sample_1_R2.fastq.gz
+ sample_2_R1.fastq.gz
+ sample_2_R2.fastq.gz
- sample_1_R1.fastq.gz
- sample_1_R2.fastq.gz
- sample_2_R1.fastq.gz
- sample_2_R2.fastq.gz
Dans ce cas on spécifie:
+ SeOrPe: PE
+ sample_suffix: .fastq.gz
+ samples: ["sample_1","sample_2"]
- SeOrPe: PE
- sample_suffix: .fastq.gz
- samples: ["sample_1","sample_2"]
Une fois ces paramètres spécifiés on peut enfin lancer le workflow de la manière suivante:
......
......@@ -4,6 +4,10 @@ rule deseq2:
output:
de_table = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/de_table.csv',
r_data = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/data.RData',
PCA = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/PCA_mqc.png',
DE_Table = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/DE_Table_mqc.tsv',
MA_plot = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/MA_plot_mqc.png',
Heatmap = config["results_dir"]+'/'+config["deseq2_output_dir"]+'/Heatmap_mqc.png',
log: config["results_dir"]+'/logs/deseq2/deseq2_log.txt'
script:
"scripts/deseq2.script.R"
\ No newline at end of file
......@@ -7,14 +7,15 @@ library(BiocParallel)
library(tximport)
library(rhdf5)
library(GenomicFeatures)
library(ggplot2)
register(MulticoreParam(snakemake@config[["deseq2_threads"]]))
samples = snakemake@config[["samples"]]#
samples = unlist(names(snakemake@config[["groups"]]))
#samples = c("DRR016125","DRR016126","DRR016127","DRR016128","DRR016129","DRR016130","DRR016131","DRR016132")
#files = file.path("/home/mbb/Documents/results2/kallisto/quant",samples,"abundance.h5")
files = sort(snakemake@input[["counts"]])
names(files) = samples
groups = snakemake@config[["groups"]]
groups = unlist(unname(snakemake@config[["groups"]]))
if (snakemake@config[["deseq2_tx2gene"]] == TRUE){
# conversion transcrits vers gènes
......@@ -49,6 +50,45 @@ res <- results(dds)
save.image(snakemake@output[["r_data"]])
write.csv(res, snakemake@output[["de_table"]])
###### PREPARE REPORT ######
cpm_counts = cpm(txi$counts)
pca = prcomp(t(cpm_counts))
png(filename = paste(snakemake@output[["PCA"]],sep = "/"), height=800, width=800)
ggplot(data.frame(pca$x),aes(x =pca$x[,1] , y = pca$x[,2],col = groups)) + geom_point()
dev.off()
options(digit=4)
resSig <- subset(res, padj < 0.1)
resSigdf = data.frame(resSig@listData,row.names = resSig@rownames)
#datatable(resSigdf,options = list(scrollX = '300px'))
cat("# id: DE_Table
# section_name: 'DE Table'
# description: 'Tableau d'expression différentielle avec p-value ajustée < 0.1'
# format: 'tsv'
# plot_type: 'table'\ngene\t", file=snakemake@output[["DE_Table"]])
write.table(resSigdf,snakemake@output[["DE_Table"]],append=TRUE,sep="\t")
png(filename = paste(snakemake@output[["MA_plot"]],sep = "/"), height=800, width=800)
plotMA(res, ylim=c(-2,2))
dev.off()
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))
cpmTop = cpm_counts[rownames(resSigdf),]
png(filename = paste(snakemake@output[["Heatmap"]],sep = "/"), height=800, width=800)
heatmap(cpmTop,main = "Heatmap",margins = c(10,1))
dev.off()
#resLFC <- lfcShrink(dds, coef="groups", type="apeglm")
#resOrdered <- res[order(res$pvalue),]
......
......@@ -7,6 +7,7 @@
description: "Differential gene expression analysis based on the negative binomial distribution. Using normalized count data.",
version: "3.8",
documentation: "https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf",
multiqc: "custom",
commands:
[
{
......@@ -16,8 +17,12 @@
inputs: [{ name: counts }],
outputs:
[
{ name: de_table, file: de_table.csv },
{ name: RData, file: data.RData },
{ name: de_table, file: de_table.csv, description: "Tableau d'expression différentielle" },
{ name: RData, file: data.RData, description: "Sauvegarde de la session R" },
{ name: PCA, file: PCA_mqc.png, description: "PCA plot" },
{ name: DE_Table, file: DE_Table_mqc.tsv, description: "Tableau d'expression différentielle avec p-value ajustée < 0.1" },
{ name: MA_plot, file: MA_plot_mqc.png, description: "MA-plot log2 fold changes (on the y-axis) versus the mean of normalized counts (on the x-axis)" },
{ name: Heatmap, file: Heatmap_mqc.png, description: "Heatmap" },
],
options:
[
......@@ -39,7 +44,7 @@
},
{
name: "deseq2_annotations",
type: "text",
type: "input_file",
value: "",
label: "Annotation file (gtf or gff) : ",
},
......
......@@ -4,6 +4,9 @@ rule edger:
output:
de_table = config["results_dir"]+'/'+config["edger_output_dir"]+'/de_table.csv',
r_data = config["results_dir"]+'/'+config["edger_output_dir"]+'/data.RData',
PCA = config["results_dir"]+'/'+config["edger_output_dir"]+'/PCA_mqc.png',
Top_genes = config["results_dir"]+'/'+config["edger_output_dir"]+'/Top_genes_mqc.tsv',
Heatmap = config["results_dir"]+'/'+config["edger_output_dir"]+'/Heatmap_mqc.png',
log: config["results_dir"]+'/logs/edger/edger_log.txt'
script:
"scripts/edger.script.R"
\ No newline at end of file
......@@ -5,11 +5,12 @@ library(edgeR)
library(tximport)
library(rhdf5)
library(GenomicFeatures)
library(ggplot2)
samples = snakemake@config[["samples"]]
samples = unlist(names(snakemake@config[["groups"]]))
files = snakemake@input[["counts"]]
names(files) = samples
groups = snakemake@config[["groups"]]
groups = unlist(unname(snakemake@config[["groups"]]))
if (snakemake@config[["edger_tx2gene"]] == TRUE){
TxDb <- makeTxDbFromGFF(file = snakemake@config[["edger_annotations"]])
......@@ -66,3 +67,38 @@ rm(txi,tx2gene)
save.image(snakemake@output[["r_data"]])
write.csv(res, snakemake@output[["de_table"]])
###### PREPARE REPORT ######
cpm_counts = cpm(cts)
pca = prcomp(t(cpm_counts))
png(filename = paste(snakemake@output[["PCA"]],sep = "/"), height=800, width=800)
ggplot(data.frame(pca$x),aes(x =pca$x[,1] , y = pca$x[,2],col = groups)) + geom_point()
dev.off()
#plotMDS(dgelist, main = "Distances between gene expression profiles")
#plotSmear(lrt, de.tags=tt$table$gene.id)
topGenes = topTags(de.com,n = 40)
cat("# id: Top_genes
# section_name: 'Top genes'
# description: 'Tableau des top genes'
# format: 'tsv'
# plot_type: 'table'\ngene\t", file=snakemake@output[["Top_genes"]])
write.table(as.data.frame(topGenes),snakemake@output[["Top_genes"]],append=TRUE,sep="\t")
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))
cpmTop = cpm_counts[rownames(topGenes$table),]
png(filename = paste(snakemake@output[["Heatmap"]],sep = "/"), height=800, width=800)
heatmap(cpmTop)
dev.off()
#plot_ly(x = colnames(cpmTop),y = rownames(cpmTop),z=cpmTop,type = "heatmap")
\ No newline at end of file
......@@ -7,6 +7,7 @@
description: "Empirical Analysis of Digital Gene Expression Data.",
version: "3.14.0",
documentation: "http://bioconductor.org/packages/release/bioc/manuals/edgeR/man/edgeR.pdf",
multiqc: "custom",
commands:
[
{
......@@ -16,8 +17,11 @@
inputs: [{ name: counts }],
outputs:
[
{ name: de_table, file: de_table.csv },
{ name: RData, file: data.RData },
{ name: de_table, file: de_table.csv, description: "Tableau d'expression différentielle" },
{ name: RData, file: data.RData, description: "Sauvegarde de la session R" },
{ name: PCA, file: PCA_mqc.png, description: "PCA plot" },
{ name: Top_genes, file: Top_genes_mqc.tsv, description: "Tableau des top gènes" },
{ name: Heatmap, file: Heatmap_mqc.png, description: "Heatmap" },
],
options:
[
......@@ -39,7 +43,7 @@
},
{
name: "edger_annotations",
type: "text",
type: "input_file",
value: "",
label: "Annotation file (gtf or gff) : ",
},
......
......@@ -7,6 +7,7 @@
description: "kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.",
version: "0.45.0",
documentation: "https://pachterlab.github.io/kallisto/manual",
multiqc: "kallisto",
commands:
[
{
......@@ -14,7 +15,7 @@
command: kallisto index,
output_dir: kallisto/index,
inputs: [{ name: kallisto_index_input, file: ""}],
outputs: [{ name: index, file: index.kidx }],
outputs: [{ name: index, file: index.kidx, description: "Index kallisto" }],
options:
[
{
......@@ -32,9 +33,9 @@
inputs: [{ name: kallisto_quant_index, file: kallisto/index/index.kidx }, { name: reads }],
outputs:
[
{ name: h5, file: "{sample}/abundance.h5" },
{ name: counts, file: "{sample}/abundance.tsv" },
{ name: run_info, file: "{sample}/run_info.json" },
{ name: h5, file: "{sample}/abundance.h5", description: "Abondances au format h5" },
{ name: counts, file: "{sample}/abundance.tsv", description: "Table d'abondance" },
{ name: run_info, file: "{sample}/run_info.json", description: "Informations sur l'exécution" },
],
options:
[
......@@ -77,9 +78,9 @@
inputs: [{ name: kallisto_quant_index, file: kallisto/index/index.kidx }, { name: reads }],
outputs:
[
{ name: h5, file: "{sample}/abundance.h5" },
{ name: counts, file: "{sample}/abundance.tsv" },
{ name: run_info, file: "{sample}/run_info.json" },
{ name: h5, file: "{sample}/abundance.h5", description: "Abondances au format h5" },
{ name: counts, file: "{sample}/abundance.tsv", description: "Table d'abondance" },
{ name: run_info, file: "{sample}/run_info.json", description: "Informations sur l'exécution" },
],
options:
[
......
......@@ -7,6 +7,7 @@
description: "Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. ",
version: "0.11.3",
documentation: "https://salmon.readthedocs.io/en/latest/",
multiqc: "salmon",
commands:
[
{
......@@ -14,7 +15,7 @@
command: salmon index,
output_dir: salmon/index,
inputs: [{ name: salmon_index_input }],
outputs: [{ name: index, directory: indexDir }],
outputs: [{ name: index, directory: indexDir, description: "Index Salmon" }],
options:
[
{
......@@ -43,9 +44,9 @@
inputs: [{ name: index, directory: salmon/index/indexDir }, { name: reads }],
outputs:
[
{ name: lib_counts, file: "salmon/quant/{sample}/lib_format_counts.json" },
{ name: counts, file: "salmon/quant/{sample}/quant.sf" },
{ name: run_info, file: "salmon/quant/{sample}/cmd_info.json" },
{ name: lib_counts, file: "salmon/quant/{sample}/lib_format_counts.json", description: "Comptages au format JSON" },
{ name: counts, file: "salmon/quant/{sample}/quant.sf", description: "Fichier de compte Salmon" },
{ name: run_info, file: "salmon/quant/{sample}/cmd_info.json", description: "Informations sur l'exécution" },
],
options:
[
......@@ -68,9 +69,9 @@
inputs: [{ name: index, file: salmon/index/indexDir }, { name: reads }],
outputs:
[
{ name: lib_counts, file: "{sample}/lib_format_counts.json" },
{ name: counts, file: "{sample}/quant.sf" },
{ name: run_info, file: "{sample}/cmd_info.json" },
{ name: lib_counts, file: "salmon/quant/{sample}/lib_format_counts.json", description: "Comptages au format JSON" },
{ name: counts, file: "salmon/quant/{sample}/quant.sf", description: "Fichier de compte Salmon" },
{ name: run_info, file: "salmon/quant/{sample}/cmd_info.json", description: "Informations sur l'exécution" },
],
options:
[
......
......@@ -7,6 +7,7 @@
description: A flexible read trimming tool for Illumina NGS data,
version: "0.38",
documentation: "http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf",
multiqc: "trimmomatic",
commands:
[
{
......
import os
import re
import csv
import snakemake.utils
dir_path = "/workflow"
......@@ -20,10 +21,10 @@ config = config["params"]
def raw_reads():
inputs = dict()
if (config["SeOrPe"] == "PE"):
inputs["read1"] = config['sample_dir']+'/{sample}_R1'+config["sample_suffix"],
inputs["read"] = config['sample_dir']+'/{sample}_R1'+config["sample_suffix"],
inputs["read2"] = config['sample_dir']+'/{sample}_R2'+config["sample_suffix"],
elif (config["SeOrPe"] == "SE"):
inputs["reads"] = config['sample_dir']+'/{sample}'+config["sample_suffix"],
inputs["read"] = config['sample_dir']+'/{sample}'+config["sample_suffix"],
else:
sys.exit("SeOrPe should be SE or PE")
return inputs
......@@ -35,12 +36,19 @@ def reads():
return raw_reads()
elif (config["trimming"] == "trimmomatic"):
if (config["SeOrPe"] == "SE"):
inputs["reads"] = config["results_dir"]+"/"+config["trimmomatic_SE_output_dir"]+"/{sample}_trimmed.fq.gz"
inputs["read"] = rules.trimmomatic_SE.output.read
elif (config["SeOrPe"] == "PE"):
inputs["read1"] = config["results_dir"]+"/"+config["trimmomatic_PE_output_dir"]+"/{sample}_forward_paired.fq.gz"
inputs["read2"] = config["results_dir"]+"/"+config["trimmomatic_PE_output_dir"]+"/{sample}_reverse_unpaired.fq.gz"
inputs["read"] = rules.trimmomatic_PE.output.readFP
inputs["read2"] = rules.trimmomatic_PE.output.readRP
return inputs
def get_groups():
with open(config["group_file"], mode="r") as infile:
reader = csv.reader(infile,delimiter='\t')
return {row[0]:row[1] for row in reader}
config["groups"] = get_groups()
# Tools inputs functions
def fastqc_inputs():
return raw_reads()
......@@ -58,15 +66,15 @@ def edger_inputs():
inputs = dict()
if (config["quantification"] == 'kallisto'):
if (config["SeOrPe"] == "PE"):
inputs["counts"] = expand(config["results_dir"]+"/"+config["kallisto_quant_PE_output_dir"]+"/{sample}/abundance.tsv",sample=SAMPLES)
inputs["counts"] = expand(rules.kallisto_quant_PE.output.counts,sample=SAMPLES)
else:
inputs["counts"] = expand(config["results_dir"]+"/"+config["kallisto_quant_SE_output_dir"]+"/{sample}/abundance.tsv",sample=SAMPLES)
inputs["counts"] = expand(rules.kallisto_quant_SE.output.counts,sample=SAMPLES)
elif (config["quantification"] == 'salmon'):
if (config["SeOrPe"] == "PE"):
inputs["counts"] = expand(config["results_dir"]+"/"+config["kallisto_quant_PE_output_dir"]+"/{sample}/abundance.tsv",sample=SAMPLES)
inputs["counts"] = expand(rules.salmon_quant_PE.output.counts,sample=SAMPLES)
else:
inputs["counts"] = expand(config["results_dir"]+"/"+config["kallisto_quant_SE_output_dir"]+"/{sample}/abundance.tsv",sample=SAMPLES)
inputs["counts"] = expand(rules.salmon_quant_SE.output.counts,sample=SAMPLES)
return inputs
......@@ -115,6 +123,8 @@ def step_outputs(step):
elif (step == "all"):
outputs = list(rules.multiqc.output)
return outputs
# get outputs for each choosen tools
def workflow_outputs(step):
......
......@@ -6,7 +6,7 @@
steps:
[
{ title: Trimming, name: trimming, tools: [trimmomatic, "null"], default: "null" },
{ title: Quality check, name: quality_check, tools: [fastqc], default: fastqc },
{ title: Quality check, name: quality_check, tools: [fastqc, "null"], default: fastqc },
{ title: Quantification, name: quantification, tools: [kallisto, salmon], default: kallisto },
{ title: Differential expression, name: DE, tools: [edger, deseq2], default: edger },
],
......@@ -32,6 +32,12 @@
value: ".fastq.gz",
label: "Samples suffix: ",
},
{
name: "group_file",
type: "input_file",
value: "",
label: "Group file (TSV with col1(sample name), col2 (group)): ",
},
{
name: "SeOrPe",
type: "radio",
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment