Aims: (i) determine effect of covariate correction on gene expression and (ii) determine whether a PC axis can be correlated with changes in the proportions of more than one cell-type.

1 File paths for workflow

source(here::here("R", "file_paths.R"))

2 Data pre-correction

2.1 Loading and filtering data

Prior to differential gene analysis, filtering was performed. This was due to an error thrown by the original analysis with all genes, where GLM can have trouble converging when there is a single row with many 0's or genes with very small counts and little power, which was resolved by filtering prior to differential expression analyses.
Given that PCA plots without filtered data already available in post-quantification QC, will instead use filtered (but not batch-corrected) data initially.

dds <- 
  readRDS(
    file.path(
      path_to_results,
      "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_design_SexRINAoD.Rds"
    )
  )
dds <- estimateSizeFactors(dds)

# Design formula
design(dds)

## ~Sex + RIN + AoD + Disease_Group

# Filter genes such that only genes with a minimum count of 1 across 100% of samples in each group are retained
filter_count <- DEGreport::degFilter(counts = counts(dds),
                                     metadata = as.data.frame(colData(dds)),
                                     group = "Disease_Group",
                                     min = 1, # All samples in group must have more than expr > 0
                                     minreads = 0)
cat("Genes in final count matrix: ", nrow(filter_count))

## Genes in final count matrix:  28692

filter_dds <- dds[rownames(filter_count),]

2.2 Heatmap of sample-to-sample distances

Transformed count matrix can be transposed (i.e. samples as rows and genes as columns), and the distance between samples can be computed and plotted in a heatmap. This can then provide an overview of the similarities/dissimilarities between samples.
As count data is not normally distributed,vst function from DESeq2 used to apply a variance-stabilising transformation. This transformation is roughly similar to putting the data on the log2 scale, while also dealing with the sampling variability of low counts. See Regularized logarithm transformation from DESeq2 paper for more thorough explanation. N.B. The vst function corrects for library size.
vst transformation has the option of being blinded to experimental design when estimating dispersion. By setting blind = FALSE, vst will use the design formula to calculate within-group variability, and if blind = TRUE it will calculate the across-all-sample variability. It does not use the design to remove variation in the data. Thus, it will not remove variation that can be associated with batch or other covariates.

vsd <- vst(object = filter_dds, 
           blind = TRUE # blind to experimental design   
           )

sampleDists <- vsd %>% 
  assay() %>% 
  t() %>% # Transpose transformed count matrix, such that samples now rows.
  dist(method = "euclidean") # Compute distances between rows of the matrix i.e. sample-to-sample distance
  
sampleDistMatrix <- sampleDists %>% 
  as.matrix() # Convert to matrix

rownames(sampleDistMatrix) <- str_c(vsd$sample_id, ": ", vsd$Disease_Group, ", ", vsd$Sex)
colnames(sampleDistMatrix) <- NULL
pheatmap <- pheatmap(sampleDistMatrix,
                     clustering_distance_rows=sampleDists,
                     clustering_distance_cols=sampleDists,
                     col= viridis(n = 20), 
                     main = "Heatmap: uncorrected data")

pheatmap$tree_row$labels <-str_c(vsd$sample_id, ": ", vsd$Disease_Group, ", ", vsd$Sex)
plot(pheatmap$tree_row, 
     main = "Cluster dendrogram: uncorrected data",
     xlab = "Euclidean distance")

- Based on heatmap of sample-to-sample distances, samples do not appear to convincingly cluster by disease group or sex. Note: this is uncorrected data, which may account for lack of clustering by what might be considered to be the main biological effects.

2.3 Scree plot

A scree plot shows the fraction of total variance explained by each PC.

pca_vsd <- prcomp(t(assay(vsd)))

tibble(PC = c(1:24), 
       sdev = pca_vsd$sdev) %>% 
  dplyr::mutate(d = sdev^2, 
                pev = d/sum(d), 
                cev = cumsum(d)/sum(d))

ggarrange(pcascree(pca_vsd, type = "pev"),
          pcascree(pca_vsd, type = "cev"), 
          nrow = 2,
          labels = c("a", "b"))

Plot of (a) the proportion of variance explained by each PC and (b) the cumulative proportion of variance explained with the addition of each PC.

Figure 2.1: Plot of (a) the proportion of variance explained by each PC and (b) the cumulative proportion of variance explained with the addition of each PC.

2.4 Correlation of PC axes with sample covariates

Easiest way to determine associations between PC and co-variates is to perform a correlation analysis between PCs and co-variates.
Figure and table below show that there is a significant association between PC1 and RIN, PC1 and proportions of excitatory and inhibitory neurons, PC2 and AoD (Age of death) and PC2 and pericye proportions.
Table shows FDR-corrected p-values of association.
Based on these analyses, will be important to correct any analyses for:
- RIN
- AoD
- Sex
- Several cell-type proportions.

# Source function correlate PCs
source(here::here("R", "correlatePCs.R"))

# Select col 1-20 (excludes Unclassified with NAs)
PC_corr <- correlatePCs(pcaobj = pca_vsd, coldata = colData(vsd)[,1:19], pcs = 1:24)

PC_corr_FDR_uncorrected <- PC_corr$pvalue %>% 
  as_tibble() %>% 
  dplyr::mutate(PC = str_c("PC_", c(1:24))) %>% 
  dplyr::select(-sample_id) %>% 
  tidyr::gather(key = covariate, value = p, -PC) %>% 
  dplyr::mutate(FDR = p.adjust(p, method = "fdr"))

PC_corr_FDR_mat <- PC_corr_FDR_uncorrected %>% 
  dplyr::select(-p) %>% 
  tidyr::spread(key = covariate, value = FDR) %>% 
  dplyr::mutate(PC_number = str_replace(PC, "PC_", "") %>% 
                  as.numeric()) %>% 
  dplyr::arrange(PC_number) %>% 
  dplyr::select(-PC_number, -PC) %>% 
  as.matrix()

PC_corr_FDR_mat <- PC_corr_FDR_mat[, match(colnames(PC_corr$statistic), colnames(PC_corr_FDR_mat))]
rownames(PC_corr_FDR_mat) <- PC_corr_FDR_uncorrected %>% .[["PC"]] %>% unique()

heatmap.2(x = -log10(PC_corr_FDR_mat) %>% t(), 
          Colv = NA, 
          Rowv = NA,
          dendrogram = "none",
          col = viridis(n = 50),
          trace = "none",
          density.info = "none", 
          key = TRUE, 
          key.xlab = "-log10(FDR-corrected p-value)", 
          cexRow = 0.65, cexCol = 0.65)

Heatmap of significance of (cor)relations between PCA scores and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

Figure 2.2: Heatmap of significance of (cor)relations between PCA scores and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

Worth noting that the "strongest effect" (non-significant) of disease group was with PC8. Cannot speak of correlations with disease group as testing using Kruskal-Wallis

3 Data following correction with AoD, RIN and sex

As setting blind = FALSE in vst does not remove variation due to confounding covariates, this will have to be done using limma::removeBatchEffect, which removes any shifts in the log2-scale expression data that can be explained by confounders.
This batch-corrected data can then be plotted in a similar manner to above.
Note: scale() function is used to ensure that overall mean expression of each gene (i.e. as calculated with rowMeans()) remains unaffected by limma::removeBatchEffect. In general, scale() is also recommended as it ensures variables that are measured in different scales (e.g. age of death vs RIN) are comparable.

vsd_covar <- vst(filter_dds,
                     blind = FALSE)

# Batch correcting and filtering vsd data 
design <- model.matrix(design(dds), data = colData(dds))

design

##       (Intercept) SexM RIN AoD Disease_GroupPD Disease_GroupPDD
## PD115           1    1 5.2  63               0                0
## PD163           1    1 5.8  76               0                0
## PD294           1    1 4.2  78               0                0
## PD332           1    1 5.3  77               0                0
## PD341           1    1 6.8  68               0                1
## PD366           1    1 5.2  69               0                1
## PD413           1    0 7.3  72               1                0
## PD415           1    0 6.5  78               0                1
## PD416           1    0 6.5  85               1                0
## PD523           1    0 5.4  86               1                0
## PD531           1    1 6.6  80               0                1
## PD563           1    1 6.7  75               0                1
## PD566           1    1 4.8  63               0                0
## PD666           1    0 6.6  93               1                0
## PD678           1    1 5.1  86               0                1
## PD683           1    0 7.5  83               1                0
## PD706           1    1 6.9  67               0                0
## PD732           1    1 7.8  81               1                0
## PD747           1    1 7.5  83               1                0
## PDC05           1    1 7.6  58               0                0
## PDC22           1    1 6.8  65               0                0
## PDC34           1    1 7.4  90               0                0
## PDC87           1    0 5.0  92               0                0
## PDC91           1    1 6.4  85               0                0
##       Disease_GroupDLB
## PD115                1
## PD163                1
## PD294                1
## PD332                1
## PD341                0
## PD366                0
## PD413                0
## PD415                0
## PD416                0
## PD523                0
## PD531                0
## PD563                0
## PD566                1
## PD666                0
## PD678                0
## PD683                0
## PD706                1
## PD732                0
## PD747                0
## PDC05                0
## PDC22                0
## PDC34                0
## PDC87                0
## PDC91                0
## attr(,"assign")
## [1] 0 1 2 3 4 4 4
## attr(,"contrasts")
## attr(,"contrasts")$Sex
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Disease_Group
## [1] "contr.treatment"

design_treatment <- design[,c("(Intercept)", "Disease_GroupPD", "Disease_GroupPDD", "Disease_GroupDLB")]
design_batch <- design[,c("SexM", "RIN", "AoD")] %>% 
  # standardise covariates
  scale()

assay(vsd_covar) <- limma::removeBatchEffect(assay(vsd_covar), design = design_treatment, covariates = design_batch)

# Check that rowMeans are equal following removal of batch effects
identical(c(assay(vsd) %>% 
              rowMeans() %>%
              round(digits = 5)),
          c(assay(vsd_covar) %>% 
              rowMeans() %>%
              round(digits = 5)))

## [1] FALSE

3.1 Heatmap of sample-to-sample distances

If we now plot the corrected vst-transformed data we see some clustering by disease group.
- Almost all four controls now clustering together in one larger cluster (see cluster furthest to left under dotted line in cluster dendrogram)
- Some clustering of female PD cases (PD683, PD555, PD413), although other two (PD523 and PD416) cluster in another cluster.
- DLB form a tighter cluster than observed for other disease groups, with four DLB cases clustering together, while the other two (PD153 and PD566) cluster with PDD cases.

sampleDists <- vsd_covar %>% 
  assay() %>% 
  t() %>% # Transpose transformed count matrix, such that samples now rows.
  dist(method = "euclidean") # Compute distances between rows of the matrix i.e. sample-to-sample distance
  
sampleDistMatrix <- sampleDists %>% 
  as.matrix() # Convert to matrix

rownames(sampleDistMatrix) <- str_c(vsd_covar$sample_id, ": ", vsd_covar$Disease_Group, ", ", vsd_covar$Sex)
colnames(sampleDistMatrix) <- NULL
pheatmap <- pheatmap(sampleDistMatrix,
                     clustering_distance_rows=sampleDists,
                     clustering_distance_cols=sampleDists,
                     col= viridis(n = 20), 
                     main = "Heatmap: corrected by AoD, RIN, Sex")

pheatmap$tree_row$labels <- str_c(vsd_covar$sample_id, ": ", vsd_covar$Disease_Group, ", ", vsd_covar$Sex)
plot(pheatmap$tree_row, 
     main = "Cluster dendrogram: corrected by AoD, RIN, Sex",
     xlab = "Euclidean distance")
abline(h = 145, col = "black", lty = 2, lwd = 2)

3.2 Scree plot

pca_vsd_covar <- prcomp(t(assay(vsd_covar)))

ggarrange(pcascree(pca_vsd_covar, type = "pev"),
          pcascree(pca_vsd_covar, type = "cev"), 
          nrow = 2,
          labels = c("a", "b"))

Figure 3.1: Plot of (a) the proportion of variance explained by each PC and (b) the cumulative proportion of variance explained with the addition of each PC.

3.3 Correlation of PC axes with sample covariates

Significant correlation of proportions of astrocytes with PC1.
Worth noting that "strongest effect" (non-significant) of disease group is now on PC1 -- also clear from reduced FDR.

PC_corr <- 
  correlatePCs(pcaobj = pca_vsd_covar, 
               coldata = colData(vsd_covar)[!colnames(colData(vsd_covar)) %in% c("Sex", "AoD", "RIN",
                                                               str_c("PC", 1:4),
                                                               "sizeFactor")], 
               pcs = 1:24)

PC_corr_FDR_covar <- PC_corr$pvalue %>% 
  as_tibble() %>% 
  dplyr::mutate(PC = str_c("PC_", c(1:24))) %>% 
  dplyr::select(-sample_id) %>% 
  tidyr::gather(key = covariate, value = p, -PC) %>% 
  dplyr::mutate(FDR = p.adjust(p, method = "fdr"))

PC_corr_FDR_covar %>% 
  dplyr::inner_join(PC_corr$statistic %>% 
               as_tibble() %>% 
               dplyr::mutate(PC = str_c("PC_", c(1:24))) %>% 
               dplyr::select(-sample_id) %>% 
               tidyr::gather(key = covariate, value = test_statistic, -PC)) %>% 
  dplyr::arrange(FDR)

PC_corr_FDR_mat <- PC_corr_FDR_covar %>% 
  dplyr::select(-p) %>% 
  tidyr::spread(key = covariate, value = FDR) %>% 
  dplyr::mutate(PC_number = str_replace(PC, "PC_", "") %>% 
                  as.numeric()) %>% 
  dplyr::arrange(PC_number) %>% 
  dplyr::select(-PC_number, -PC) %>% 
  as.matrix()
  
PC_corr_FDR_mat <- PC_corr_FDR_mat[, match(colnames(PC_corr$statistic), colnames(PC_corr_FDR_mat))]
rownames(PC_corr_FDR_mat) <- PC_corr_FDR_covar %>% .[["PC"]] %>% unique()

heatmap.2(x = -log10(PC_corr_FDR_mat) %>% t(), 
          Colv = NA, 
          Rowv = NA,
          dendrogram = "none",
          col = viridis(n = 50),
          trace = "none",
          density.info = "none", 
          key = TRUE, 
          key.xlab = "-log10(FDR-corrected p-value)", 
          cexRow = 0.65, cexCol = 0.65)

Heatmap of significance of (cor)relations between PCA scores (from vst-transformed data corrected by AoD, RIN, and Sex) and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

Figure 3.2: Heatmap of significance of (cor)relations between PCA scores (from vst-transformed data corrected by AoD, RIN, and Sex) and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

4 Data following correction with PC axes

If we filter the uncorrected PC correlates for FDR < 0.1, we see that of those PC axes significantly correlated with covariates, only PC axes 1-4 are relevant.
It may be preferable to correct using PC axes because the number of covariates included in design is lower using PC axes (only 4), as opposed to individual covariates (where we would have to include 6), thus increasing the degrees of freedom in the final model.

We can extract these PCs from the pca object returned by prcomp() and use these to correct our vst-transformed data, as opposed to using covariates.
In DESeq2 documentation, it is advised that blind=FALSE should be used for transforming data for downstream analysis, where the full use of the design information should be made. Extract from (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): "...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.".
Thus, blind = FALSE used -- design information included sex, RIN, AoD and disease group -- all of which might have an impact on dispersion information.

vsd <- vst(object = filter_dds, 
           blind = FALSE
           )

pca_vsd_temp <- prcomp(t(assay(vsd)))


# Extract coordinates of individuals on principle components 1-4
PC_axes <- pca_vsd_temp$x[, str_c("PC", 1:4)] %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "sample_id")

# Include PC axes in filter_dds colData & design
filter_dds <- DESeqDataSetFromMatrix(countData = assay(filter_dds),
                              colData = colData(filter_dds) %>% 
                                as_tibble() %>% 
                                dplyr::inner_join(PC_axes),
                              design = ~ PC1 + PC2 + PC3 + PC4 + Disease_Group
                              # design = ~ PC1 + PC2 + PC3 + Disease_Group
                              )

vsd_PCs <- vst(filter_dds,
               blind = FALSE)

# Batch correcting and filtering vsd data 
design <- model.matrix(design(filter_dds), data = colData(filter_dds))

design

##       (Intercept)         PC1         PC2          PC3         PC4
## PD115           1 -33.7044981  52.4543091 -29.87346240  47.4180149
## PD163           1 -15.6289549  42.5407558 -11.73420114   3.7483976
## PD294           1 -68.3858139  12.2262443   8.30707515   3.2666959
## PD332           1 -42.4353524   0.2549566 -31.16960757  -3.7778382
## PD341           1   8.3189876 -31.7616303 -15.92128848 -11.3993829
## PD366           1 -17.1162086  87.6354674  23.08660041 -37.8240825
## PD413           1  78.4424456   3.2497646   3.11884920   2.2874986
## PD415           1 -35.2256957 -46.7325064 -36.51937057 -18.2474939
## PD416           1 -22.9678355 -30.4400740 -16.99754425 -24.6111455
## PD523           1 -76.7055717 -14.7059928   4.18557155 -14.7063713
## PD531           1 -12.1729726  -5.5772149   1.93735880  47.2932426
## PD563           1  47.0340242  -9.9604452   1.92854852 -29.5416805
## PD566           1 -10.4034995  88.4266823  23.91318866 -16.2760322
## PD666           1   8.7735545 -37.3889648  -6.71685389 -13.9364552
## PD678           1 -76.4683970 -19.4924417  -5.23019929  -5.3715995
## PD683           1  53.0447040   9.6930507   6.78920711   5.2380875
## PD706           1  -0.2844714  15.9705954 -44.03420828  56.4641793
## PD732           1  38.1857756 -35.5831394 -10.72723061  10.7899832
## PD747           1  50.5545676 -17.2890476  -7.23761901   0.4222244
## PDC05           1  98.5452954   6.2628661  -0.02449618  30.0139473
## PDC22           1  45.7555522  13.3556441  23.97440950 -24.5683731
## PDC34           1  36.9585214  -9.8641986  17.71501302 -25.4472610
## PDC87           1 -41.6316552 -48.6639688 111.83282607  40.9933366
## PDC91           1 -12.4825017 -24.6107122 -10.60256632 -22.2278921
##       Disease_GroupPD Disease_GroupPDD Disease_GroupDLB
## PD115               0                0                1
## PD163               0                0                1
## PD294               0                0                1
## PD332               0                0                1
## PD341               0                1                0
## PD366               0                1                0
## PD413               1                0                0
## PD415               0                1                0
## PD416               1                0                0
## PD523               1                0                0
## PD531               0                1                0
## PD563               0                1                0
## PD566               0                0                1
## PD666               1                0                0
## PD678               0                1                0
## PD683               1                0                0
## PD706               0                0                1
## PD732               1                0                0
## PD747               1                0                0
## PDC05               0                0                0
## PDC22               0                0                0
## PDC34               0                0                0
## PDC87               0                0                0
## PDC91               0                0                0
## attr(,"assign")
## [1] 0 1 2 3 4 5 5 5
## attr(,"contrasts")
## attr(,"contrasts")$Disease_Group
## [1] "contr.treatment"

# Split into treatment and batch effects
design_treatment <- design[,c("(Intercept)", "Disease_GroupPD", "Disease_GroupPDD", "Disease_GroupDLB")]
design_batch <- 
  design[,str_c("PC", 1:4)] %>%
  # design[,str_c("PC", 1:3)] %>% 
  # standardise covariates
  scale()

# Correction with limma
assay(vsd_PCs) <- limma::removeBatchEffect(assay(vsd_PCs), design = design_treatment, covariates = design_batch)

4.1 Heatmap of sample-to-sample distances

If we now plot the vst-transformed data corrected by PC axes we see an even clearer clustering by disease group.
- All DLB cases cluster together in one cluster that is separate from all other disease groups.
- For the most part, controls cluster together, with the exception of PDC91 (although it is part of the bigger cluster which includes some PD cases -- see under dotted line on cluster)
- PD and PDD do not appear to cluster separately

sampleDists <- vsd_PCs %>% 
  assay() %>% 
  t() %>% # Transpose transformed count matrix, such that samples now rows.
  dist(method = "euclidean") # Compute distances between rows of the matrix i.e. sample-to-sample distance
  
sampleDistMatrix <- sampleDists %>% 
  as.matrix() # Convert to matrix

rownames(sampleDistMatrix) <- str_c(vsd_PCs$sample_id, ": ", vsd_PCs$Disease_Group, ", ", vsd_PCs$Sex)
colnames(sampleDistMatrix) <- NULL
pheatmap <- pheatmap(sampleDistMatrix,
                     clustering_distance_rows=sampleDists,
                     clustering_distance_cols=sampleDists,
                     col= viridis(n = 20), 
                     main = "Heatmap: corrected by PC axes 1-4")

pheatmap$tree_row$labels <- str_c(vsd_PCs$sample_id, ": ", vsd_PCs$Disease_Group)
plot(pheatmap$tree_row, 
     main = "Cluster dendrogram: corrected by PC axes 1-4",
     xlab = "Euclidean distance")
abline(h = 85.5, col = "black", lty = 2, lwd = 2)

4.2 Scree plot

pca_vsd_PCs <- prcomp(t(assay(vsd_PCs)))

ggarrange(pcascree(pca_vsd_PCs, type = "pev"),
          pcascree(pca_vsd_PCs, type = "cev"), 
          nrow = 2,
          labels = c("a", "b"))

Figure 4.1: Plot of (a) the proportion of variance explained by each PC and (b) the cumulative proportion of variance explained with the addition of each PC.

4.3 Correlation of PC axes with sample covariates

Significant correlation of OPC proportions and disease group with PC axes.
It is worth noting that if we plot correlations amongst covariates, then OPC proportions, vascular proportions and alpha-synuclein scoring are correlated with disease group. Further, with testing, Disease Group, alpha-synuclein and OPC proportions are all significantly correlated with PC1.
This suggests that we may want to explore further corrections for OPC proportions.

PC_corr_PCs <- correlatePCs(pcaobj = pca_vsd_PCs, coldata = colData(vsd_PCs)[1:19], pcs = 1:24)

PC_corr_FDR_PCs <- PC_corr_PCs$pvalue %>% 
  as_tibble() %>% 
  dplyr::mutate(PC = str_c("PC_", c(1:24))) %>% 
  dplyr::select(-sample_id) %>% 
  tidyr::gather(key = covariate, value = p, -PC) %>% 
  dplyr::mutate(FDR = p.adjust(p, method = "fdr"))

PC_corr_FDR_PCs %>% 
  dplyr::inner_join(PC_corr_PCs$statistic %>% 
               as_tibble() %>% 
               dplyr::mutate(PC = str_c("PC_", c(1:24))) %>% 
               dplyr::select(-sample_id) %>% 
               tidyr::gather(key = covariate, value = test_statistic, -PC)) %>% 
  dplyr::arrange(FDR)

DEGreport::degCorCov(colData(vsd))

Figure 4.2: Plot of correlations between sample covariates.

PC_corr_FDR_mat <- PC_corr_FDR_PCs %>% 
  dplyr::select(-p) %>% 
  tidyr::spread(key = covariate, value = FDR) %>% 
  dplyr::mutate(PC_number = str_replace(PC, "PC_", "") %>% 
                  as.numeric()) %>% 
  dplyr::arrange(PC_number) %>% 
  dplyr::select(-PC_number, -PC) %>% 
  as.matrix()
  
PC_corr_FDR_mat <- PC_corr_FDR_mat[, match(colnames(PC_corr_PCs$statistic), colnames(PC_corr_FDR_mat))]
rownames(PC_corr_FDR_mat) <- PC_corr_FDR_covar %>% .[["PC"]] %>% unique()

heatmap.2(x = -log10(PC_corr_FDR_mat) %>% t(), 
          Colv = NA, 
          Rowv = NA,
          dendrogram = "none",
          col = viridis(n = 50),
          trace = "none",
          density.info = "none", 
          key = TRUE, 
          key.xlab = "-log10(FDR-corrected p-value)", 
          cexRow = 0.65, cexCol = 0.65)

Heatmap of significance of (cor)relations between PCA scores (from vst-transformed data corrected by PC axes 1-4) and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

Figure 4.3: Heatmap of significance of (cor)relations between PCA scores (from vst-transformed data corrected by PC axes 1-4) and sample covariates. Significance computed using Kruskal-Wallis for categorical variables and the cor.test based on Spearman's correlation for continuous variables. P-values were FDR-corrected.

5 Conclusions

We can summarise the above corrections and their effect on clustering by disease group by plotting PC1 and PC2 and colouring by disease group.
Without correction, 95% confidence ellipses overlap and it is hard to observe any clustering by disease group.
With correction by AoD, Sex and RIN, some clustering by disease group occurs, although 95% confidence ellipses still overlap to some extent.
With correction by PC axes 1-4, we see a clear clustering by disease group, with no overlap between 95% confidence ellipses. Further when new PC axes are derived following correction, disease group, alpha-synuclein staining and OPC proportions are found to significantly correlate with PC1.
From this, we can conclude that we should be accounting for PC axes 1-4 in our differential expression model.

list_vsd <- list(pca_vsd, pca_vsd_covar, pca_vsd_PCs)
plot_list <- vector(mode = "list", length = 3)
titles <- c("Uncorrected", "Corrected by AoD, Sex, RIN", "Corrected by PC axes 1-4")

for(i in 1:length(list_vsd)){
  
  plot_list[[i]] <- fviz_pca_ind(list_vsd[[i]],
                                 geom.ind = "point", # show points only (nbut not "text")
                                 col.ind = colData(vsd)[,c("Disease_Group")], # color by groups
                                 addEllipses = TRUE, ellipse.type = "confidence", # Confidence ellipses
                                 palette =  pal_jco("default", alpha = 0.9)(10)[c(3,9,1,2)],
                                 mean.point = FALSE,
                                 legend.title = "Disease group", 
                                 title = titles[i]
  )
  
}

ggarrange(plotlist = plot_list,
          labels = c("a", "b", "c"),
          nrow = 3)

PCA plot of individuals grouped by disease group. Each plot represents data that is (a) uncorrected, (b) corrected by AoD, Sex, RIN and (c) corrected by PC axes 1-4. Ellipses represent the 95% confidence level. Points and ellipses are coloured by disease group.

Figure 5.1: PCA plot of individuals grouped by disease group. Each plot represents data that is (a) uncorrected, (b) corrected by AoD, Sex, RIN and (c) corrected by PC axes 1-4. Ellipses represent the 95% confidence level. Points and ellipses are coloured by disease group.

# saveRDS(list_vsd, here::here("workflows", "figures","list_vsd_correction_strategies.Rds"))

6 Session Info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] vsn_3.54.0                  viridis_0.5.1              
##  [3] viridisLite_0.3.0           readxl_1.3.1               
##  [5] pcaExplorer_2.12.0          forcats_0.5.1              
##  [7] stringr_1.4.0               dplyr_1.0.2                
##  [9] purrr_0.3.4                 readr_1.4.0                
## [11] tidyr_1.1.1                 tibble_3.0.3               
## [13] tidyverse_1.3.0             pheatmap_1.0.12            
## [15] ggsci_2.9                   ggpubr_0.4.0               
## [17] ggrepel_0.8.2               gplots_3.0.4               
## [19] factoextra_1.0.7            ggplot2_3.3.2              
## [21] DT_0.15                     devtools_2.3.2             
## [23] usethis_1.6.3               DESeq2_1.26.0              
## [25] SummarizedExperiment_1.16.1 DelayedArray_0.12.3        
## [27] BiocParallel_1.20.1         matrixStats_0.56.0         
## [29] Biobase_2.46.0              GenomicRanges_1.38.0       
## [31] GenomeInfoDb_1.22.1         IRanges_2.20.2             
## [33] S4Vectors_0.24.4            BiocGenerics_0.32.0        
## [35] DEGreport_1.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.1              SparseM_1.78               
##   [3] AnnotationForge_1.28.0      pkgmaker_0.31.1            
##   [5] bit64_4.0.2                 knitr_1.29                 
##   [7] data.table_1.13.0           rpart_4.1-15               
##   [9] RCurl_1.98-1.2              doParallel_1.0.15          
##  [11] generics_0.0.2              preprocessCore_1.48.0      
##  [13] callr_3.5.1                 cowplot_1.0.0              
##  [15] RSQLite_2.2.0               bit_4.0.4                  
##  [17] xml2_1.3.2                  lubridate_1.7.9            
##  [19] httpuv_1.5.4                assertthat_0.2.1           
##  [21] xfun_0.16                   hms_1.0.0                  
##  [23] evaluate_0.14               promises_1.1.1             
##  [25] progress_1.2.2              caTools_1.18.0             
##  [27] dbplyr_1.4.4                Rgraphviz_2.30.0           
##  [29] igraph_1.2.5                DBI_1.1.1                  
##  [31] geneplotter_1.64.0          tmvnsim_1.0-2              
##  [33] htmlwidgets_1.5.3           reshape_0.8.8              
##  [35] ellipsis_0.3.1              crosstalk_1.1.0.1          
##  [37] backports_1.1.8             bookdown_0.21              
##  [39] annotate_1.64.0             gridBase_0.4-7             
##  [41] biomaRt_2.42.1              vctrs_0.3.2                
##  [43] here_1.0.0                  remotes_2.2.0              
##  [45] Cairo_1.5-12.2              abind_1.4-5                
##  [47] cachem_1.0.3                withr_2.2.0                
##  [49] lasso2_1.2-20               checkmate_2.0.0            
##  [51] prettyunits_1.1.1           mnormt_2.0.2               
##  [53] cluster_2.1.0               crayon_1.4.1               
##  [55] genefilter_1.68.0           labeling_0.4.2             
##  [57] edgeR_3.28.1                pkgconfig_2.0.3            
##  [59] nlme_3.1-141                pkgload_1.1.0              
##  [61] nnet_7.3-12                 rlang_0.4.7                
##  [63] lifecycle_0.2.0             registry_0.5-1             
##  [65] affyio_1.56.0               BiocFileCache_1.10.2       
##  [67] GOstats_2.52.0              modelr_0.1.8               
##  [69] cellranger_1.1.0            rprojroot_2.0.2            
##  [71] rngtools_1.5                graph_1.64.0               
##  [73] Matrix_1.2-17               carData_3.0-4              
##  [75] reprex_1.0.0                base64enc_0.1-3            
##  [77] GlobalOptions_0.1.2         processx_3.4.5             
##  [79] png_0.1-7                   rjson_0.2.20               
##  [81] bitops_1.0-6                shinydashboard_0.7.1       
##  [83] ConsensusClusterPlus_1.50.0 KernSmooth_2.23-15         
##  [85] blob_1.2.1                  shape_1.4.4                
##  [87] shinyAce_0.4.1              jpeg_0.1-8.1               
##  [89] rstatix_0.6.0               ggsignif_0.6.0             
##  [91] scales_1.1.1                GSEABase_1.48.0            
##  [93] memoise_2.0.0               magrittr_2.0.1             
##  [95] plyr_1.8.6                  bibtex_0.4.3               
##  [97] gdata_2.18.0                zlibbioc_1.32.0            
##  [99] threejs_0.3.3               compiler_3.6.1             
## [101] RColorBrewer_1.1-2          d3heatmap_0.6.1.2          
## [103] clue_0.3-57                 affy_1.64.0                
## [105] cli_2.2.0.9000              XVector_0.26.0             
## [107] Category_2.52.1             ps_1.3.4                   
## [109] htmlTable_2.0.1             Formula_1.2-4              
## [111] MASS_7.3-51.4               tidyselect_1.1.0           
## [113] stringi_1.5.3               shinyBS_0.61               
## [115] highr_0.8                   yaml_2.2.1                 
## [117] askpass_1.1                 locfit_1.5-9.4             
## [119] latticeExtra_0.6-29         grid_3.6.1                 
## [121] tools_3.6.1                 rio_0.5.16                 
## [123] circlize_0.4.10             rstudioapi_0.13            
## [125] foreach_1.5.0               foreign_0.8-72             
## [127] logging_0.10-108            gridExtra_2.3              
## [129] farver_2.0.3                BiocManager_1.30.10        
## [131] digest_0.6.27               shiny_1.5.0                
## [133] Rcpp_1.0.5                  car_3.0-9                  
## [135] broom_0.7.0                 later_1.1.0.1              
## [137] httr_1.4.2                  Nozzle.R1_1.1-1            
## [139] ggdendro_0.1.21             AnnotationDbi_1.48.0       
## [141] ComplexHeatmap_2.7.7        psych_2.0.7                
## [143] colorspace_2.0-0            rvest_0.3.6                
## [145] XML_3.99-0.3                fs_1.5.0                   
## [147] topGO_2.38.1                splines_3.6.1              
## [149] RBGL_1.62.1                 sessioninfo_1.1.1          
## [151] xtable_1.8-4                jsonlite_1.7.1             
## [153] testthat_2.3.2              R6_2.5.0                   
## [155] Hmisc_4.4-1                 pillar_1.4.6               
## [157] htmltools_0.5.1.1           mime_0.9                   
## [159] NMF_0.23.0                  glue_1.4.2                 
## [161] fastmap_1.0.1               codetools_0.2-16           
## [163] pkgbuild_1.1.0              lattice_0.20-38            
## [165] curl_4.3                    gtools_3.8.2               
## [167] zip_2.1.0                   GO.db_3.10.0               
## [169] openxlsx_4.2.3              openssl_1.4.2              
## [171] survival_3.2-7              limma_3.42.2               
## [173] rmarkdown_2.5               desc_1.2.0                 
## [175] munsell_0.5.0               GetoptLong_1.0.2           
## [177] GenomeInfoDbData_1.2.2      iterators_1.0.12           
## [179] reshape2_1.4.4              haven_2.3.1                
## [181] gtable_0.3.0

Post-correction PCA

Regina H. Reynolds