Bulk tissue
# Dataframe of unique junctions with respect to genomic location
unique_junc_tiss <- sj_df_tissue %>%
tidyr::unite(col = "junction", chr, intron_start, intron_end, sep = ":", remove = FALSE) %>%
dplyr::group_by(junction) %>%
dplyr::summarise(unique_reads_junction_across_all_samples = sum(unique_reads_junction)) %>%
inner_join(sj_df_tissue %>%
dplyr::select(-unique_reads_junction, -multi_map_reads_junction, -max_splice_alignment_overhang, -Sample) %>%
tidyr::unite(col = "junction", chr, intron_start, intron_end, sep = ":", remove = FALSE) %>%
distinct(junction, .keep_all = TRUE)) %>%
dplyr::select(junction, chr, intron_start, intron_end, everything())
- Total number of unique (non-duplicated) junctions: 1147644
- Percentage of unique(non-duplicated) junctions not in annotation: 74.1567943
- The proportion of unique junctions currently not in annotation is quite high. How does the distribution of reads mapping to annotated/unannotated compare?
sj_df_tissue %>%
dplyr::mutate(in_annotation = ifelse(in_annotation == 1, TRUE, FALSE)) %>%
dplyr::group_by(in_annotation) %>%
dplyr::summarise(min_unique_reads_junction = min(unique_reads_junction),
max_unique_reads_junction = max(unique_reads_junction))
ggplot(data = sj_df_tissue %>%
dplyr::mutate(in_annotation = ifelse(in_annotation == 1, TRUE, FALSE)),
aes(x = in_annotation, y = unique_reads_junction, fill = in_annotation)) +
geom_boxplot(outlier.shape = NA) +
ylim(0,100) +
labs(y = "Number of junctions reads") +
theme_bw() +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title = element_text(size = 8, face = "bold"),
strip.text = element_text(size = 8, face = "bold"))
- Distributions are very different, as one might expect, with a shift toward higher numbers of junction reads in the annotated distribution.
- Also, if we look at the number of junction reads mapping to unannotated and annotated junctions in each sample compared to the total number of reads (junction and normal) in a sample following QC, we observe a positive correlation (as would be expected).
# Relationship between junctions and read depth
summary_sj <- sj_df_tissue %>%
dplyr::mutate(in_annotation = ifelse(in_annotation == 1, TRUE, FALSE)) %>%
dplyr::group_by(Sample, in_annotation) %>%
dplyr::summarise(n_junction_reads = sum(unique_reads_junction)) %>%
dplyr::mutate(sample_shortened = str_replace(Sample, "_.*", "")) %>%
inner_join(QC_tissue %>% dplyr::select(SampleID, after_filtering_total_reads), by = c("sample_shortened" = "SampleID")) %>%
dplyr::select(-sample_shortened)
ggplot(data = summary_sj) +
geom_point(aes(x = after_filtering_total_reads, y = n_junction_reads, colour = in_annotation)) +
facet_wrap(~in_annotation, scales = "free_y") +
labs(x = "Total number of reads after QC", y = "Total number of junction reads") +
theme_bw() +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title = element_text(size = 8, face = "bold"),
strip.text = element_text(size = 8, face = "bold"))
- Note that a high proportion of the total number of junction reads map to particular junctions (see below table). E.g. ~0.7% of the total junction reads map to an area in chromosome 18 wherein in MBP (myelin basic protein) is located.
# Percentage of unique reads mapping to unique junctions in relation to total number of unique reads across all junctions
unique_junc_tiss %>%
dplyr::mutate(total_unique_reads_junction = sum(unique_reads_junction_across_all_samples),
percentage_of_total_unique_reads_junction = (unique_reads_junction_across_all_samples/total_unique_reads_junction) * 100) %>%
dplyr::select(-total_unique_reads_junction) %>%
arrange(desc(percentage_of_total_unique_reads_junction))
- Looking at more of these junctions and the genes within the regions we find:
- 18:76980472-76984774 --> Various forms of MBP (myelin basic protein)
- 11:61965113-61965368 --> Various forms of FTH1 (ferritin heavy chain)
- 8:27598251-27598459 --> Various forms of CLU (clusterin)
- 19:48966407-48966582 --> FTL (ferritin light chain)
- 12:6537997-6538100 --> GAPDH
- 6:73518265-73518353 --> EEF1A1 (eukaryotic translation elongation factor 1 alpha 1)
- 2:216500032-216501340 --> RRPL37A (ribosomal protein L37a)
- 17:44908150-44910614 --> GFAP
- These junctions reflect genes that we might expect to be highly expressed (e.g. GAPDH). Interesting to see MBP and GFAP. Given that majority of samples here are from disease states, we might expect to see GFAP amongst highly expressed proteins.