@@ -8,19 +8,30 @@ A more detailed description of the files is below the list (use ctrl + f)
Files:
* 00_create_py_env.sh
* prepping.sh*
* 01_quality_check.sh*
* 01_2_quality_check.sh*
* 02_trimm_and_clean.sh*
* 03_mapping.sh*
* 04_snpcalling.sh*
* 04a_FreeBayes.sh*
* 04b_GATK.sh
* 04c_VarDict.sh
* 05b_convert_protospacer_dico2fasta.py*
* 06b_blast_protospaces.sh*
* 07_2_run_vcf_parser_all_files.py
* 07_2_test.py
* 07_run_vcf_parser_all_files.py*
* vcf_to_csv.py
* update_ref_genome.py
* update_ref_geome_Wlog.py
* procedure.sh
* conda_procedure.sh
* README.md
* requirements_py-env.txt
* coevolution_env.yml
* vardict_env.yml
* vcf_parser3.py
...
...
@@ -104,13 +115,24 @@ samtools index -b ${outdir}${root_name}.sort.bam
Creates a python virtual environment using `virtualenv`, the default python3 version of the system and will storte the environment in `~/envs/coev`. The installation of packages is done through pip.
### prepping.sh*
Prepares the data into subgroups.
Make conda environments based on .yml files.
Make links to data to make sub-groups
Outputs a text file, "prepping.txt", to let the user know when it is done.
prepping.sh is only executed when running through the conda_procedure.sh file and not the procedure.sh file.
### 01_quality_check.sh*
It will use FastQC to create quality control reports and then use multiqc to assemble the reports in only file. To make things easier, the input files are separated in 3 groups R, W and Other. These groups come from different treatments.
This script takes one argument: The path to the working directory, which is the project directory: `/home/user/work/coevolution/phages/`. **Don't forget that final stroke**
### 01_2_quality_check.sh*
As 01_quality_check.sh, but is executed on the trimmed data. This is to evaluate the trimm and clean step below.
### 02_trimm_and_clean.sh*
...
...
@@ -129,7 +151,7 @@ The mapper is bowtie2, after mapping the sam is sorted and converted to a bam an
This script takes one argument: The path to the working directory, which is the project directory: `/home/user/work/coevolution/phages/`. **Don't forget that final stroke**
### 04_snpcalling.sh*
### 04a_FreeBayes.sh*
It uses freebayes to make the snp calling
...
...
@@ -138,6 +160,13 @@ This script takes three arguments:
2 Path to the reference
3 Path to the output directory
### 04b_GATK.sh
It uses GATK to make the snp calling - not a part of current procedure!
### 04c_VarDict.sh
It uses VarDict to make the snp calling - not a part of current procedure!
### 05b_convert_protospacer_dico2fasta.py*
...
...
@@ -181,80 +210,42 @@ To run from ipython:
%run 07_2_run_vcf_parser_all_files.py
```
----
### vcf_to_csv.py
Takes the vcf files from snp calling as input and outputs a csv file with information for further analysis. One csv file is generated for each population across timepoints. Thereby, 8 W and 8 R csv's are made.
The columns in the csv holds the following information:
## The other scripts
### procedure.sh
The commands used to launch the scripts up here as well as the supplementary commands to separate the different data, extract and all other action is written here.
It contains a pre-treatment of data to create sub-groups using symbolic links.
-- POS: The genomic position of the variant
-- TIME: T1, T2, T3 or T4
-- ALT: Alternative allele
-- REF: Reference allele
-- AO: Alternate allele observation count
-- DP: Read depth
-- TYPE: "snp", "mnp", ins", "del" or "complex"
-- FREQ: The allele frequency, filter >= 0.025
----
### update_ref_genome.py
Allows to create a consensus fasta file from the reference fasta
and a VCF file from which the variants will be integrated.
It filters the variants with a minimum frequency of 0.45.
From the genome of Streptococcus virus 2972 we'll be looking for new PAMs (Protospacer Associated Motif).
Takes a vcf made from snp calling, containing variants called in TO-WT. This vcf together with the original reference fasta sequence, is used to make an updated reference genome to be used for further snp calling.
The original scripts were made to teach Antoine Nicot how to program.
### update_ref_genome_Wlog.py
The researced patterns are: **AGAA** and **GGAA**; as well as the reverse complementary sequences **TTCT** and **TTCC**.
Same as above but outputs a log.txt in same location as fasta sequence. This log.txt contains information about the variants included in the updated reference.
The PAMS are little sequences of 4 nucleotides located on the 3' side of a protospacer sequence. The lenght of a protospacer taken into account is 32 + 4 (protospacer + PAM).