PE75 sequencing data analysis workflow.

1. Raw data

Sample_BamHI_column_1
Sample_BamHI_column_2
Sample_BamHI_column_3
Sample_BamHI_column_4
Sample_BamHI_column_5
Sample_BamHI_column_6
Sample_BamHI_column_7
Sample_BamHI_column_8
Sample_BamHI_column_9
Sample_BamHI_column_10
Sample_BamHI_column_11
Sample_BamHI_column_12
Sample_BamHI_row_1
Sample_BamHI_row_2
Sample_BamHI_row_3
Sample_BamHI_row_4
Sample_BamHI_row_5
Sample_BamHI_row_6
Sample_BamHI_row_7
Sample_BamHI_row_8
Sample_NaeI_column_1
Sample_NaeI_column_2
Sample_NaeI_column_3
Sample_NaeI_column_4
Sample_NaeI_column_5
Sample_NaeI_column_6
Sample_NaeI_column_7
Sample_NaeI_column_8
Sample_NaeI_column_9
Sample_NaeI_column_10
Sample_NaeI_column_11
Sample_NaeI_column_12
Sample_NaeI_row_1
Sample_NaeI_row_2
Sample_NaeI_row_3
Sample_NaeI_row_4
Sample_NaeI_row_5
Sample_NaeI_row_6
Sample_NaeI_row_7
Sample_NaeI_row_8

2. Trimming

Trimming adaptors/primers/barcode and then trimming Ds edge 12 bp in 4 possible orentiatons (sense, antisense, 5' and 3'), Below is the example for one sample.

./trim_galore  --paired  --stringency 12 -a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001.fastq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001.fastq -o Sample_BamHI_column_1/
./trim_galore  --paired --stringency 12 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGATCTCGTATGCCGTCTTCTGCTTG Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_1.fq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_2.fq -o Sample_BamHI_column_1/
./trim_galore  --paired --stringency 12 -a TAGGGATGAAAA Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1.fq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_2_val_2.fq -o Sample_BamHI_column_1/
./trim_galore  --paired --stringency 12 -a TTTTCATCCCTA Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1_val_1.fq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_2_val_2_val_2.fq -o Sample_BamHI_column_1/
./trim_galore  --paired --stringency 12 -a CTTTCATCCCTA Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1_val_1_val_1.fq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_2_val_2_val_2_val_2.fq -o Sample_BamHI_column_1/
./trim_galore  --paired --stringency 12 -a TAGGGATGAAAG Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1_val_1_val_1_val_1.fq Sample_BamHI_column_1/BamHI_column_1_GTTTCG_L002_R1_001_val_2_val_2_val_2_val_2_val_2.fq -o Sample_BamHI_column_1/

3. Identify junction reads

They orginally contain Ds region and maize region
(i) Run junction_read_identification.sh, which does the following:
(i.1) it copies run.rb into each of those 20 sample folder (Sample_* column or row)
(i.2) then the run.rb compares two files -- one is the output of the second trimming in the talon_Ds_trimming job (e.g., BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1.fq) and the other is the final output file of trimming (e.g., BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1_val_1_val_1_val_1_val_1.fq)
(ii) then all the junction reads are stored as two *.selected.fq files in each of those Sample directory


4. Bowtie Alignment

Bowtie version 1 is used so that we can allow up to 2 mismatches.
First, we combined all the junction reads together for bowtie alignment.

$ cat  Sample_*/*.selected.fq > clean_reads.fastq
$ bowtie -p 8 -n 2 -a --best Zea_mays.AGPv3.21.dna.chromosome.all -q clean_reads.fastq > clean_reads.bowtie.a.best.out


5. Process the bowtie output to count the number of alignment for each reads in order to identify those "best" uniquly aligned reads

Input: clean_reads.bowtie.a.best.out, which is the output of the above bowtie command
Output: is a count of each aligned reads in the bowtie output (the first column is the count)

$ perl perl_read.pl > clean_reads.bowtie.a.best.unique.txt

Then we did the grep command below to identify those best uniuqly aligned reads
Also use linux trick to delete the first column (number 1) from the above reads.list.unique.txt
then the sort command below is to sort the file based on chr and start position

$ grep -P "^1\t"  clean_reads.bowtie.a.best.unique.txt > reads.list.unique.txt
$ sort -V -k2,3 -t $'\t' reads.list.unique.txt > reads.list.unique.sorted.txt

6. Identify the sources for each junction reads (which row sample or column sample it comes from)

(i) Run /storage/research/qd0005/KASHI_20140417_PE75/KR_analysis/junction_read_source.sh, which does the following:
(i.1) it copies junction.rb into each of those 20 sample folder (Sample_* column or row) at
(i.2) then the junction.rb compares two files -- one is the output of the second trimming in the talon_Ds_trimming job (e.g., BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1.fq) and the final output file of trimming (e.g., BamHI_column_1_GTTTCG_L002_R1_001_val_1_val_1_val_1_val_1_val_1_val_1.fq)
(ii) then all the junction read source information are stored as two *.junction.txt files in each of those Sample directory

$ cat  Sample_*/*.junction.txt > junction_reads.rowcol.txt, which is at /storage/research/qd0005/KASHI_20140417_PE75/KR_analysis, which has four columns: col1 - read ID, col2 - trimmed seq, col3 - with Ds region, col4 - source

7. Combine the junction reads bowtie alignment position and junction read source information together

$ perl parse_mod_bowtie.pl > reads.list.unique.sorted.junction.txt

8. Identify the insertion sites from each junction reads alignment info

Input is reads.list.unique.sorted.junction.txt
It reads every line of the input, if the read strand is +, then its end position is the insertion site; if the read strand is -, then its start position is the insertion sites
The output has four column for each identified insertion sites: chr, start, end, insertion_sites (the start and end refers to the coordinates of its supported reads)
Then we used linux command to extract the first and fourth column (chr and insertion_sites) and load it into MySQL table "insertion"

$ ruby parse_junction.rb > reads.list.unique.sorted.junction.insertion.txt
$ awk -F"\t" '{print $1"\t"$4}' reads.list.unique.sorted.junction.insertion.txt | sort -u | sed 's/chr//g' > reads.list.unique.sorted.junction.insertion.mysql.txt


8. Run MySQL queries to find out where the insertions are located with respect of genes

mysql> desc insertion;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| chr   | varchar(5) | YES  |     | NULL    |       |
| pos   | int(12)    | YES  |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

#how many unique insertion sites were identified
mysql> select count(*) from insertion;
+----------+
| count(*) |
+----------+
|     3900 |
+----------+
1 row in set (0.00 sec)

#how many of the above identified insertion sites are WITHIN a known gene
mysql> select count(distinct gene_gff.id) from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= gene_gff.start and insertion.pos <= gene_gff.end;
+-----------------------------+
| count(distinct gene_gff.id) |
+-----------------------------+
|                         215 |
+-----------------------------+
1 row in set (22.27 sec)

#or within certain distance of known genes
mysql> select count(distinct gene_gff.id) from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= (gene_gff.start - 500) and insertion.pos <= (gene_gff.end + 500);
+-----------------------------+
| count(distinct gene_gff.id) |
+-----------------------------+
|                         259 |
+-----------------------------+
1 row in set (10.26 sec)

mysql> select count(distinct gene_gff.id) from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= (gene_gff.start - 1000) and insertion.pos <= (gene_gff.end + 1000);
+-----------------------------+
| count(distinct gene_gff.id) |
+-----------------------------+
|                         295 |
+-----------------------------+
1 row in set (10.23 sec)

mysql> select count(distinct gene_gff.id) from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= (gene_gff.start - 2000) and insertion.pos <= (gene_gff.end + 2000);
+-----------------------------+
| count(distinct gene_gff.id) |
+-----------------------------+
|                         358 |
+-----------------------------+
1 row in set (10.24 sec)

# The following MySQL groups the #of insertion sites per gene
mysql> select gene_gff.id, count(*) from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= gene_gff.start and insertion.pos <= gene_gff.end group by gene_gff.id;
215 rows in set (9.87 sec)

mysql> select gene_gff.id, count(*) as C from insertion, gene_gff where insertion.chr = gene_gff.chr and insertion.pos >= gene_gff.start and insertion.pos <= gene_gff.end group by gene_gff.id having C=1;
147 rows in set (9.93 sec)

9. The goal is to find out which of those insertion sites are supported by row and column intersecting

The script below identify those insertion sites supported by 1 row and 1 column intersecting, because thsoe the only cases that we know for sure they are the right row/col address

$ perl SingleDsInsertionAddress.pl > insertion_address.txt

mysql> load data local infile "~/insertion_address.txt" into table insertion_address;

mysql> select count(*) from insertion_address;
+----------+
| count(*) |
+----------+
|      775 |
+----------+

mysql> select count(distinct gene_gff.id) from insertion_address, gene_gff where insertion_address.chr = gene_gff.chr and insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end order by insertion_address.chr, insertion_address.pos;
+-----------------------------+
| count(distinct gene_gff.id) |
+-----------------------------+
|                          71 |
+-----------------------------+
1 row in set (9.36 sec)

mysql> select gene_gff.id, count(*) from insertion_address, gene_gff where insertion_address.chr = gene_gff.chr and insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end group by gene_gff.id;
71 rows in set (2.09 sec)

mysql> select gene_gff.id, count(*) as C from insertion_address, gene_gff where insertion_address.chr = gene_gff.chr and insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end group by gene_gff.id having C=1;
29 rows in set (2.06 sec)

10. JBrowse display

Link is here JBrwose


Additional information

mysql> select gene_gff.id, insertion_address.* from insertion_address, gene_gff where insertion_address.chr = gene_gff.chr and insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end into outfile '/tmp/insertion_address_gene.txt';
Query OK, 534 rows affected (2.65 sec)

mysql> select insertion_address.*, gene_gff.* from insertion_address, gene_gff where insertion_address.chr = gene_gff.chr and insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end into outfile '/tmp/insertion_address_gene_gff.txt';

mysql> select insertion_address.*, gene_gff.* from insertion_address left join gene_gff on insertion_address.chr = gene_gff.chr where insertion_address.pos >= gene_gff.start and insertion_address.pos <= gene_gff.end into outfile '/tmp/insertion_address_left_join_gene_gff.txt';
Query OK, 534 rows affected (2.84 sec)

$ perl SingleDsInsertionAddressCount.pl > SingleDsInsertionAddressCount.txt

Link to the text files.
insertion_address_gene.txt
insertion_address_gene_gff.txt
SingleDsInsertionAddressCount.txt