Commit 2785f343 authored by eortega's avatar eortega
Browse files

Added update about blast of spacers on protospacers dictionary

parent 52e63215
......@@ -48,7 +48,7 @@ The script `1_sc_tout.sh` selects only the sequence of the fastq.gz file and doe
11. Creates an intermediary output file: **inter3**
The script `2_replace_protospacer6.py`
The script `2_replace_protospacer6.py` **updated to 2_replace_protospacerv2_inter3, see below for details**
The script `3_sc_countReplMism.sh`
......@@ -59,6 +59,23 @@ The script `5_sc_count_stain_MonoMulti.sh`
The script `6_sc_BIMlines.sh`
#### Details on replacing sequences of spacers by the name of the protospacers
The update consists in blasting each spacer (query) and replacing it for the protospacer name of the best blast hit (bbh). This is slow.
I made a shorter version where with the script `2_replace_protospacersv2_uniq_input.py` But a re-count at the end is necessary since multiple sequences could match to the same protospacer
I used `sort | uniq -c` to count the number of occurrences in inter3 > inter3_sorted
a) inter3_replaced : I blasted every spacer and replaced for the protospacer's name of the first match (best blast hit). I didn't replaced the sequence if it didn't matched. It was sort of quick (1h30 for all samples with an old computer). However many sequences were matched to the same protospacer so a re-count taking into account the previous count would be necessary
b) inter4_replaced: the output as made in the original workflow. I blasted all the spacers as in the previous one, however data not being condensed it took 180h of real time (on 10 parallel jobs an ramdisk to lower the time of read/write of files). This could allow the original workflow to work as it did before.
c) inter4_replaced_sort_count: I used `sort | uniq -c` to get a count table for each which you could use more easily to do statistics. Beware of the spaces to the left. You can use lstrip() in python to remove them. Each line is a long read where you can see the combination of protospacers which have been integrated by the bacteria.
### Outputs
#### inter3
......@@ -67,3 +84,5 @@ The first output created is `inter3`. It contains strings starting and ending wi
The repeat sequences have been replaced with *-* The sequences are the spacers inserted by the phage.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment