Commit cdda5692 authored by eortega's avatar eortega
Browse files

Updated readme for get_PAM.py and update_ref_genome.py

parent c4ec68b0
......@@ -194,6 +194,68 @@ The commands used to launch the scripts up here as well as the supplementary com
It contains a pre-treatment of data to create sub-groups using symbolic links.
### update_ref_genome.py
Allows to create a consensus fasta file from the reference fasta
and a VCF file from which the variants will be integrated.
It filters the variants with a minimum frequency of 0.45.
It is equivalent to bcftools consensus:
`bcftools consensus --sample unknown -f NC_007019.1.fasta TO-WT_S83.vcf.gz -o test.fasta`
### get_PAM.py -- Getting new PAMs
From the genome of Streptococcus virus 2972 we'll be looking for new PAMs (Protospacer Associated Motif).
The original scripts were made to teach Antoine Nicot how to program.
The researced patterns are: **AGAA** and **GGAA**; as well as the reverse complementary sequences **TTCT** and **TTCC**.
The PAMS are little sequences of 4 nucleotides located on the 3' side of a protospacer sequence. The lenght of a protospacer taken into account is 32 + 4 (protospacer + PAM).
#### Example of sequences & Nomenclature
Here are the templates of sequences.
The PAM sequences are found in 3 prime of the protospacers.
Each needs a unique name of ID for the fasta header.
```
>PAM_Coord_ref_genome Motif=AGAA;Strand=+;Protospacer_length=N
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAGAA
>PAM_Coord_ref_genome Motif=GGAA;Strand=+;Protospacer_length=N
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGGAA
>PAM_Coord_ref_genome Motif=AGAA;Strand=-;Protospacer_length=N
TTCTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>PAM_Coord_ref_genome Motif=GGAA;Strand=-;Protospacer_length=N
TTCCxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
The nomenclature of the fasta header goes as follows:
```
>PAM_152_Sv2972 M=AGAA;Strand=+;Protospacer_length=32
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAGAA
>PAM_10863_Sv2972 M=TTCT;Strand=-;Protospacer_length=32
TTCTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
Mandatory fields for ID field in header FASTA. No spaces
* **PAM** = Type of sequence **P**rotospacer **A**ssociated **M**otif
* **152** = The coordinate of the first nucleotide of the motif.
* **Sv2972** = Reference genome used to find these sequences
Comment field for FASTA format. No strict rules. I used Definition=Value without spaces. Each separated by a semi-colon
* **Motif=AGAA** = Motif The motif used to find that sequence
* **Strand=+** = The strand where the sequence is. If the motif is on the strand -, the sequence present will be the reverse complementary.
* **Protospacer_length** = The lenght of sequence concidered to be a protospacer without the lenght of the PAM.
### README.md
This file :-P
......@@ -235,3 +297,5 @@ toy examples and commands to be called from the other scripts
Dispensable.
Created by python automatically when importing a `*.py`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment