Aims: to annotate STAR-outputted junctions by .gtf exon definitions; to remove background noise; to determine the proportion of annotated/partially annotated/unannotated junctions in each group; and to compare this with leafcutter defined introns

Hypothesis: if there is a general problem with the splicing machinery, we might expect it to be reflect in the proportion of partially annotated/unannotated i.e. increased in disease groups compared to control
To do this, will require creating an rse_jx.rda file, similar to recount2 format. This requires three objects summarised as a RangedSummarizedExperiment:
1. rowRanges: GRanges object containing all unique junction ids found across samples, with annotation
2. colData: metadata for all samples
3. assay: count data for each junction in each sample. Rows = junctions; columns = sample ids.

1 File paths/files for workflow

source(here::here("R", "file_paths.R"))

sample_info <- 
  read_excel(path = 
               file.path(
                 path_to_raw_data,
                 "sample_details/20201229_MasterFile_SampleInfo.xlsx"
               ), 
             sheet = "SampleInfo", 
             skip = 1)  %>% 
  dplyr::filter(Sample_Type == "Tissue section" & sent_to_bulk_seq == "yes") %>% 
  dplyr::rename(sample_id = CaseNo, 
                RIN = RINe_bulkRNA_Tapestation) %>% 
  dplyr::mutate(Disease_Group = fct_relevel(Disease_Group,
                                            c("Control", "PD", "PDD","DLB")),
                Sex = as.factor(Sex)) %>% 
  dplyr::select(sample_id, Disease_Group, Sex, AoO, AoD, DD, PMI, aSN, TAU, 'thal AB', 'aCG aSN score', Genetics, RIN) %>% 
  dplyr::inner_join(
    read_delim(
      file.path(
        path_to_results,
        "deconvolution/scaden/2020Feb/scaden_summary_results.txt"
      ), 
      delim = "\t")  %>% 
      dplyr::mutate(Disease_Group = fct_relevel(Disease_Group,
                                                c("Control", "PD", "PDD", "DLB")),
                    Celltype = str_c(Celltype, "_proportions")) %>% 
      dplyr::filter(cells == 1000 & samples == 1000) %>% 
      dplyr::select(-source, -cells, -samples, -Disease_Group, -Celltype_class) %>% 
      tidyr::spread(key = Celltype, value = cell_type_proportion)
  )

2 Creating rse_jx object

This involved:
- Loading all SJ.out.tab files, excluding merged_junctions.SJ.out.tab, which contains only junction coordinates and not counts, etc.
- Creating the rse_jx object with the function create_rse_jx(). As a part of this, rowRanges were annotated using .gtf annotations, with the annotate_junc_ref() from David's dasper package. This function has since been deprecated and replaced by junction_annot. Both functions perform the same task, but junction_annot has been made to run faster. annotate_junc_ref() takes as input the junction co-ordinates and annotates the start and end position based on precise overlap with a known exon boundary. Then using reference annotation, infers the strand of junction and categorises junctions in "annotated", "novel_acceptor", "novel_donor", "novel_combo", novel_exon_skip", "ambig_gene" & "none".

source(here::here("R", "load_sj_df.R"))
source(here::here("R", "create_rse_jx.R"))

sj_df <- 
  load_sj_df(
    file.path(
      path_to_bulk_seq,
      "STAR"
    )
  ) %>% 
  dplyr::filter(Sample != "merged_junctions.SJ.out.tab") %>% 
  dplyr::mutate(Sample = str_replace(Sample, "_.*", ""))

rse_jx <- create_rse_jx(sj_df = sj_df,
                        sample_info = sample_info,
                        sample_id_col = "sample_id") 

saveRDS(rse_jx, 
        file = 
          file.path(
            path_to_results, 
            "STAR/STAR_rse_jx.rds")
        )

3 Removing background noise

This involves removing junctions that may not be genuine products of spliceosome activity.
Filtering removed any junctions that:
- Map to ENCODE blacklist regions
- Form an intron of length < 25 nucleotides. This number is based on theoretical minimum intron length, which in 80% of cases includes (for references refer to Sonia Ruiz' document located locally in the directory: .\SoniaRuiz\20200107_Defining_junctions_length.docx):
  - 6 nt to form the canonical 5' splice signal sequence
  - 2 nt to form the canonical 3' splice signal sequence
  - 5 nt to form the canonical branch point sequence
  - 12 nt to form both the polypyrimidine tract region (commonly located between the BPS and 3' ss) and the AG exclusion zone (which is the region between the BPS and 3'ss marked by the absence of AG dinucleotides)

rse_jx <- 
  readRDS(
    file = 
      file.path(
        path_to_results, 
        "STAR/STAR_rse_jx.rds"
      )
  )

# Load encode blacklist (https://github.com/Boyle-Lab/Blacklist/tree/master/lists)
ENCODE_blacklist_hg38 <- rtracklayer::import("/data/references/ENCODE_blacklist_v2/hg38-blacklist.v2.bed")

ENCODE_blacklist_hg38 <- 
  ENCODE_blacklist_hg38 %>% 
  as.data.frame() %>% 
  dplyr::mutate(chr = str_replace(seqnames, "chr", "")) %>% 
  dplyr::select(-seqnames) %>% 
  makeGRangesFromDataFrame(.,
                           keep.extra.columns = TRUE,
                           seqnames.field = "chr",
                           start.field = "start",
                           end.field = "end",
                           ignore.strand = FALSE)

# Find junctions that overlap with ENCODE blacklist regions
overlapped_junctions <- 
  GenomicRanges::findOverlaps(query = ENCODE_blacklist_hg38,
                                                          subject = rowRanges(rse_jx),
                                                          ignore.strand = F)

ids <- subjectHits(overlapped_junctions)

# Remove junctions overlapping blacklist regions
rse_jx <- rse_jx[-ids,]

# Find junctions with width < 25
width_25 <- which(c(rowRanges(rse_jx) %>% width()) < 25)

rse_jx <- rse_jx[-width_25,]

# Total number of junctions removed
length(ids) + length(width_25) # 19640 + 6620 = 26260

saveRDS(
  rse_jx, 
  file = 
    file.path(
      path_to_results, 
      "STAR/STAR_rse_jx_filtered.rds"
    )
)

4 Proportion annotated/partially annotated/unannotated per group

dds <- 
  readRDS(
    file.path(
      path_to_results,
      "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_design_SexRINAoD.Rds"
    )
  )
rse_jx <- 
  readRDS(
    file.path(
      path_to_results, 
      "STAR/STAR_rse_jx_filtered.rds"
    )
  )

library_size <- 
  dds %>% 
  assay() %>% # Access count data
  as_tibble() %>% 
  dplyr::mutate(gene_id = rownames(assay(dds))) %>% 
  gather(key = sample_id, value = counts, -gene_id) %>% 
  group_by(sample_id) %>% 
  dplyr::summarise(mapped_read_depth = sum(counts))

filtered_counts <- 
  assay(rse_jx) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "junction_id") %>%
  tidyr::gather(key = "sample_id", value = "count", -junction_id) %>% 
  dplyr::filter(count > 0)

annot_ref_df <- rowRanges(rse_jx)[c(filtered_counts %>% .[["junction_id"]]), "junc_cat"] %>% 
  mcols() %>% 
  as.data.frame() %>%
  tibble::rownames_to_column(var = "junction_id") %>% 
  dplyr::inner_join(filtered_counts) %>% 
  dplyr::filter(junc_cat != "ambig_gene") %>% 
  dplyr::group_by(sample_id, junc_cat) %>% 
  dplyr::summarise(n_junc = n(),
                   count = sum(count)) %>% 
  dplyr::inner_join(library_size) %>% 
  dplyr::mutate(total_junc = sum(n_junc),
                total_count = sum(count),
                prop_junc = n_junc/total_junc,
                prop_count = count/total_count,
                prop_mapped_read = count/mapped_read_depth)


# Calculate difference between acceptor and donor for all metrics
# Add disease groups and re-level disease groups + junc_cat
annot_ref_df <- 
  annot_ref_df %>% 
  tidyr::gather(key = "metric", value = "value", -junc_cat, -sample_id) %>% 
  tidyr::spread(key = junc_cat, value = value) %>% 
  dplyr::mutate(diff_acceptor_donor = novel_acceptor - novel_donor) %>% 
  tidyr::gather(key = "junc_cat", value = "value", -sample_id, -metric) %>% 
  tidyr::spread(key = metric, value = value) %>% 
  dplyr::inner_join(colData(rse_jx) %>% 
                      as.data.frame() %>% 
                      rownames_to_column(var = "sample_id") %>% 
                      dplyr::select(sample_id, Disease_Group)) %>% 
  dplyr::mutate(Disease_Group = fct_relevel(Disease_Group,
                                            c("Control", "PD", "PDD", "DLB")),
                junc_cat = fct_relevel(junc_cat,
                                        c("annotated", "novel_acceptor", "novel_donor", "diff_acceptor_donor", "novel_exon_skip", "novel_combo", "none")))

plot_list <- vector(mode = "list", length = 2)

# Plot junction proportions by disease group
plot_list[[1]] <- 
  annot_ref_df %>% 
  ggplot(aes(x = junc_cat, y = prop_junc, fill = Disease_Group)) +
  geom_boxplot(colour = "black") + 
  labs(x = "", y = "Proportion of junctions") +
  ylim(c(0,round(max(annot_ref_df$prop_count), digits = 2))) +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr

# Plot count proportions by disease group  
plot_list[[2]] <- 
  annot_ref_df %>% 
  ggplot(aes(x = junc_cat, y = prop_count, fill = Disease_Group)) +
  geom_boxplot(colour = "black") + 
  facet_wrap(vars(junc_cat), scale = "free") +
  labs(x = "", y = "Proportion of counts") +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )
  
ggarrange(plotlist = plot_list,
          labels = c("a", "b"),
          common.legend = T, 
          nrow = 2)

Figure 4.1: Proportion of (a) junctions and (b) counts assigned to each annotation type.

Looking at proportions of junctions attributed to each annotation type, the following is observed:
- Across PD and PDD, the proportion of junctions of all annotation types appear to be quite similar to control, with perhaps a slight increase in the proportion of annotated junctions in PDD compared to controls, and likewise a slight decrease in the proportion of partially annotated and unannotated junctions in PDD compared to control.
- For DLB, clear increase in the number of annotated junctions compared to controls, with a corresponding decrease in propotion compared to controls in most other annotation types (i.e. novel acceptor/donor and none).
Looking at the proportion of counts attributed to each annotation type, the opposite pattern is observed. That is, the proportion of counts attributed to annotated junctions decreases from PD --> PDD --> DLB. A similar trend is seen for the partially annotated annotation types, such as exon skip, novel acceptor and novel donor, while for the proportion of counts attributed to the unannotated "none" increases from PD --> PDD --> DLB.
What is worth noting that while the partially annotated annotation types (novel acceptor/donor, novel exon skip, novel combo) make up ~10-15% of junctions, the proportion of counts assigned to these annotation types is much smaller.
For a clearer picture, it may be worth grouping into annotated/partially annotated/unannotated.

# Assign junc_cat to annotation classes i.e annotated/partially annotated/unannotated
annot_ref_df <- 
  annot_ref_df %>% 
  dplyr::mutate(annot_class = ifelse(junc_cat == "annotated", "annotated",
                                    ifelse(junc_cat %in% c("none"), "unannotated", "partially_annotated")) %>% 
                  as.factor()) 

plot_list <- vector(mode = "list", length = 2)

# Plot junction proportions by disease group
plot_list[[1]] <- 
  annot_ref_df %>% 
  dplyr::group_by(Disease_Group, sample_id, annot_class) %>% 
  dplyr::summarise(prop_junc = sum(prop_junc)) %>% 
  ggplot(aes(x = annot_class, y = prop_junc, fill = Disease_Group)) +
  geom_boxplot(colour = "black") + 
  ylim(c(0,round(max(annot_ref_df$prop_count), digits = 2))) +
  labs(x = "", y = "Proportion of junctions") +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr

# Plot count proportions by disease group  
plot_list[[2]] <-
  annot_ref_df %>% 
  dplyr::group_by(Disease_Group, sample_id, annot_class) %>% 
  dplyr::summarise(prop_count = sum(prop_count)) %>% 
  dplyr::ungroup() %>%   
  ggplot(aes(x = annot_class, y = prop_count, fill = Disease_Group)) +
  geom_boxplot(colour = "black") +
  facet_zoom(ylim = c(0,0.015), zoom.size = 1) +  
  labs(x = "", y = "Proportion of counts") +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr
  
ggarrange(plotlist = plot_list,
          labels = c("a", "b"),
          common.legend = T,
          nrow = 2)

Proportion of (a) junctions and (b) counts assigned to each annotation class. Annotation types from previous plot merged, such that annotated contains only 'annotated', partially annotated contains 'novel_acceptor', 'novel_donor', 'novel_exon_skip', 'novel_combo' and unannotated contains 'none'

Figure 4.2: Proportion of (a) junctions and (b) counts assigned to each annotation class. Annotation types from previous plot merged, such that annotated contains only 'annotated', partially annotated contains 'novel_acceptor', 'novel_donor', 'novel_exon_skip', 'novel_combo' and unannotated contains 'none'

Similar pattern to before, that is:
- Proportion of junctions: PD and PDD have similar proportions of annotated, partially annotated, and unannotated junctions compared to controls. DLB, on the other hand, has increased proportion of annotated junctions and decrease proportions of partially annotated and unannotated junctions.
- Proportion of counts: again, changes are most obvious for the DLB group, with a decreased proportion of counts assigned to annotated and partially annotated junctions compared to controls and an increased proportion of counts assigned to unannotated junctions. The proportion of counts in PD and PDD groups assigned to each annotation type is similar to what is seen in controls, although their median proportion of counts attributed to annotated/partially annotated junctions is decreased compared to controls while the median proportion of counts attributed to unannotated junctions is increaed compared to controls.

5 Covariate correction

Used PC axes derived from expression data

# Load PC axes and join to annot_ref_df 
annot_ref_df <- 
  annot_ref_df %>% 
  dplyr::inner_join(
    readRDS(
      file.path(
        path_to_results,
        "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_filtered_design_PCaxes_DEG.Rds"
      )
    ) %>% 
      colData() %>% 
      as_tibble() %>% 
      dplyr::select(sample_id, PC1, PC2, PC3, PC4))

# Model: prop ~ Disease_Group
annot_ref_df %>%   
  tidyr::gather(key = prop_type, value = prop, starts_with("prop")) %>% 
  dplyr::group_by(prop_type, junc_cat) %>% 
  do(lm(prop ~ Disease_Group, data = .) %>% 
       broom::tidy()) %>% 
  # dplyr::filter(p.value < 0.05) %>%
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

# Model: prop ~ Disease_Group + PC1 + PC2 + PC3 + PC4
annot_ref_df %>%   
  tidyr::gather(key = prop_type, value = prop, starts_with("prop")) %>% 
  dplyr::group_by(prop_type, junc_cat) %>% 
  do(lm(prop ~ Disease_Group + PC1 + PC2 + PC3 + PC4, data = .) %>% 
       broom::tidy()) %>% 
  # dplyr::filter(p.value < 0.05) %>%
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

If we test disease group alone, then the DLB group coefficient comes out as significant predictor of the proportion of junctions/counts in almost all annotation types, except exon skip where it is not a significant predictor of proportion of counts. However, upon inclusion of PC1-4, only PC axes come out significant predictors of the proportion of junctions/counts in each annotation type. Given that these PCs are proxies for covariates like AoD, RIN and various cell-type proportions, suggesting these are our predictors.
We can think of the disease as a single grouping variable (as compared to in lm() where each level of the variable is individually tested). We can then use the drop1() function, which essentially tests how dropping each variable from the full model changes the fit. If a variable contributes significantly to the fit of a model, we would expect a full model and one where the term is dropped to be significantly different -- this is represented by the p.value column in the table below.

for(i in 1:length(unique(annot_ref_df$junc_cat))){
  
  df <- drop1(lm(prop_junc ~ Disease_Group + PC1 + PC2 + PC3 + PC4, 
           data = annot_ref_df %>% 
             dplyr::filter(junc_cat == unique(annot_ref_df$junc_cat)[i])), 
        scope = ~., test = "F") %>% 
    tidy() %>% 
    dplyr::mutate(junc_cat = unique(annot_ref_df$junc_cat)[i]) %>% 
    dplyr::select(junc_cat, term, p.value, everything())
  
  if(i == 1){
    anova_df <- df
  } else{
    anova_df <- anova_df %>% 
      bind_rows(df)
  }
  
}

anova_df  %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

for(i in 1:length(unique(annot_ref_df$junc_cat))){
  
  df <- drop1(lm(prop_count ~ Disease_Group + PC1 + PC2 + PC3 + PC4, 
           data = annot_ref_df %>% 
             dplyr::filter(junc_cat == unique(annot_ref_df$junc_cat)[i])), 
        scope = ~., test = "F") %>% 
    tidy() %>% 
    dplyr::mutate(junc_cat = unique(annot_ref_df$junc_cat)[i]) %>% 
    dplyr::select(junc_cat, term, p.value, everything())
  
  if(i == 1){
    anova_df <- df
  } else{
    anova_df <- anova_df %>% 
      bind_rows(df)
  }
  
}

anova_df  %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

Here we see that removing the Disease_Group from the model has no significant effect on the fit of the model i.e. it is not a significant predictor of proportion of junctions/counts.

6 Comparison to Leafcutter-defined introns

Leafcutter intron counts annotated previous in: cluster_validation_btwn_datasets.html
Proportion of each type of junction will be exactly the same across each sample, as the same introns are contained within each sample. Thus, only model proportion counts assigned to each junction category.
Modelled both:
- Proportion of total junction counts: count in junction category divided by total number of junction counts across sample
- Proportion of mapped reads: count in junction category divided by the library size

clu <- 
  readRDS(
    file.path(
      path_to_results,
      "leafcutter/intron_clustering/tissue_polyA_all_clusters_gtfannotated.Rds"
    )
  )

# Calculate proportions
annot_leafcutter <- clu %>% 
  as_tibble() %>% 
  dplyr::mutate(cluster_id_full = str_c(cluster_id, ":", start, "-", end)) %>% 
  dplyr::select(contains("cluster_id"), contains("leafcutter"), junc_cat) %>% 
  tidyr::gather(key = "sample_id", value = "count", -cluster_id, -cluster_id_full, -junc_cat) %>% 
  dplyr::mutate(sample_id = str_remove(sample_id, "_.*")) %>% 
  dplyr::filter(junc_cat != "ambig_gene") %>% 
  dplyr::group_by(sample_id, junc_cat) %>% 
  dplyr::summarise(n_junc = n(),
                   count = sum(count)) %>% 
  dplyr::inner_join(library_size) %>% 
  dplyr::mutate(total_junc = sum(n_junc),
                total_count = sum(count),
                prop_junc = n_junc/total_junc,
                prop_count = count/total_count,
                prop_mapped_read = count/mapped_read_depth)

plot_list <- vector(mode = "list", length = 2)

# Calculate difference between acceptor and donor for all metrics
# Add disease groups and re-level disease groups + junc_cat
annot_leafcutter <- 
  annot_leafcutter %>% 
  tidyr::gather(key = "metric", value = "value", -junc_cat, -sample_id) %>% 
  tidyr::spread(key = junc_cat, value = value) %>% 
  dplyr::mutate(diff_acceptor_donor = novel_acceptor - novel_donor,
                ratio_acceptor_donor = novel_acceptor/novel_donor) %>% 
  tidyr::gather(key = "junc_cat", value = "value", -sample_id, -metric) %>% 
  tidyr::spread(key = metric, value = value) %>% 
  dplyr::inner_join(colData(rse_jx) %>% 
                      as.data.frame() %>% 
                      rownames_to_column(var = "sample_id") %>% 
                      dplyr::select(sample_id, Disease_Group)) %>% 
  dplyr::mutate(Disease_Group = fct_relevel(Disease_Group,
                                            c("Control", "PD", "PDD", "DLB")),
                junc_cat = fct_relevel(junc_cat,
                                        c("annotated", "novel_acceptor", "novel_donor", "diff_acceptor_donor", "ratio_acceptor_donor", "novel_exon_skip", "novel_combo", "none")))

plot_list <- vector(mode = "list", length = 2)

# Plot count proportions by disease group  
plot_list[[1]] <- annot_leafcutter %>% 
  ggplot(aes(x = junc_cat, y = prop_count, fill = Disease_Group)) +
  geom_boxplot(colour = "black") + 
  facet_wrap(vars(junc_cat), scales = "free") +
  labs(x = "", y = "Proportion of total junction counts") +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr + 
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

plot_list[[2]] <- annot_leafcutter %>% 
  ggplot(aes(x = junc_cat, y = prop_mapped_read, fill = Disease_Group)) +
  geom_boxplot(colour = "black") + 
  facet_wrap(vars(junc_cat), scales = "free") +
  labs(x = "", y = "Proportion of total mapped reads") +
  scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)]) +
  theme_rhr + 
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

ggarrange(plotlist = plot_list,
          labels = c("a", "b"),
          common.legend = T, 
          nrow = 2)

Figure 6.1: Proportion of (a) total junction counts and (b) mapped reads assigned to each annotation class.

# Load PC axes and join to annot_ref_df 
annot_leafcutter <- 
  annot_leafcutter %>% 
  dplyr::inner_join(
    readRDS(
      file.path(
        path_to_results,
        "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_filtered_design_PCaxes_DEG.Rds"
      )
    ) %>% 
      colData() %>% 
      as_tibble())

# Model: prop ~ Disease_Group
annot_leafcutter %>%   
  dplyr::select(-prop_junc) %>% 
  tidyr::gather(key = prop_type, value = prop, starts_with("prop")) %>% 
  dplyr::group_by(prop_type, junc_cat) %>% 
  do(lm(prop ~ Disease_Group, data = .) %>% 
       broom::tidy()) %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

# Model: prop ~ Disease_Group + Sex + RIN + AoD
annot_leafcutter %>%   
  dplyr::select(-prop_junc) %>% 
  tidyr::gather(key = prop_type, value = prop, starts_with("prop")) %>% 
  dplyr::group_by(prop_type, junc_cat) %>% 
  do(lm(prop ~ Disease_Group + Sex + RIN + AoD, data = .) %>% 
       broom::tidy()) %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

# Model: prop ~ Disease_Group + PC1 + PC2 + PC3 + PC4
annot_leafcutter %>%   
  dplyr::select(-prop_junc) %>% 
  tidyr::gather(key = prop_type, value = prop, starts_with("prop")) %>% 
  dplyr::group_by(prop_type, junc_cat) %>% 
  do(lm(prop ~ Disease_Group + PC1 + PC2 + PC3 + PC4, data = .) %>% 
       broom::tidy()) %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

Same conclusion as previously. Correcting for PC axes removes any impact of DLB on proportion of junction counts/mapped reads.

7 Distribution of intron lengths

Worth just doing a quick check of intron lengths across STAR and Leafcutter introns, and how these might vary across different types of junction category.

7.1 STAR

rse_jx %>% 
  rowRanges() %>% 
  as_tibble() %>% 
  dplyr::select(junction_id, width, junc_cat) %>% 
  dplyr::group_by(junc_cat) %>% 
  dplyr::summarise(
    median_intron_length = median(width)
  )

rse_jx %>% 
  rowRanges() %>% 
  as_tibble() %>% 
  dplyr::select(junction_id, width, junc_cat) %>% 
  dplyr::filter(junc_cat != "ambig_gene") %>% 
  dplyr::mutate(
    junc_cat = 
      fct_relevel(
        junc_cat,
        c("annotated", "novel_acceptor", "novel_donor", "diff_acceptor_donor", "novel_exon_skip", "novel_combo", "none")
      )
  ) %>% 
  ggplot(aes(x = junc_cat, y = width)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0,50000)) +
  labs(x = "Junction category", y = "Intron length") +
  theme_rhr

Intron length from all STAR-aligned junctions, except those of length < 25 and those that overlapped ENCODE blacklist regions. Y-axis has been limited to show only introns of length 0-50,000.

Figure 7.1: Intron length from all STAR-aligned junctions, except those of length < 25 and those that overlapped ENCODE blacklist regions. Y-axis has been limited to show only introns of length 0-50,000.

7.2 Leafcutter

clu %>% 
  as_tibble() %>% 
  dplyr::select(seqnames, start, end, cluster_id, width, junc_cat) %>% 
  dplyr::group_by(junc_cat) %>% 
  dplyr::summarise(
    median_intron_length = median(width)
  )

clu %>% 
  as_tibble() %>% 
  dplyr::select(seqnames, start, end, cluster_id, width, junc_cat) %>% 
  dplyr::filter(junc_cat != "ambig_gene") %>% 
  dplyr::mutate(
    junc_cat = 
      fct_relevel(
        junc_cat,
        c("annotated", "novel_acceptor", "novel_donor", "diff_acceptor_donor", "novel_exon_skip", "novel_combo", "none")
      )
  ) %>% 
  ggplot(aes(x = junc_cat, y = width)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0,50000)) +
  labs(x = "Junction category", y = "Intron length") +
  theme_rhr

Intron length following filtering to remove introns overlapping ENCODE blacklist regions and filtering by Leafcutter's intron clustering (which also filters by junction counts). Y-axis has been limited to show only introns of length 0-50,000.

Figure 7.2: Intron length following filtering to remove introns overlapping ENCODE blacklist regions and filtering by Leafcutter's intron clustering (which also filters by junction counts). Y-axis has been limited to show only introns of length 0-50,000.

8 Conclusions

Original hypothesis that problems with splicing machinery might be reflected by increase in proportion partially annotated/unannotated junctions. This does not appear to be clear from the proportion of junctions assigned to each annotation type, although perhaps more so from the proportion of junction counts attributed to each annotation type. However, following covariate correction, it would appear that disease is not a significant predictor of proportion of junctions/counts assigned.

STAR: intron annotation

Regina H. Reynolds