Aim: identification of (i) differentially expressed genes and (ii) differentially spliced genes, and their associated molecular pathways, that distinguish between control, PD, PDD and DLB.

1 File paths for workflow

source(here::here("R", "file_paths.R"))

2 Background

2.1 Sample details and demographics

2.1.1 Column descriptions

2.1.2 Table of samples

2.1.3 Correlations between covariates

  • Using the package DEGreport can easily plot the correlation amongst covariates.
dds <- 
  readRDS(
    file.path(
      path_to_results,
      "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_design_SexRINAoD.Rds"
    )
  )

DEGreport::degCorCov(colData(dds))
Plot of correlations between sample covariates.

Figure 2.1: Plot of correlations between sample covariates.

  • As might be expected, alpha-synuclein staging was correlated with disease group. Interestingly, OPC proportions were also correlated with disease group. In addition, proportions of excitatory and inhibitory neurons were correlated.

2.2 Methodological workflow

All steps below, describe the workflow for bulk-tissue RNA-sequencing:

  1. Data pre-processing
    • Pre-alignment QC using fastp combined with fastQC.
    • Alignment using STAR.
    • Post-alignment QC using RNASeqQC.
  2. Gene-level quantification
    • Salmon used to quantify gene-level reads.
  3. Estimation of cell-type proportions (deconvolution)
    • Estimate cell-type proportions for use in differential expression/splicing models.
    • Used Scaden.
  4. Differential gene-level expression
    • DESeq2 used for differential expression analyses.
  5. Differential splicing
    • Leafcutter used. Defines intron clusters by shared junctions and calculates differential usage of introns within a cluster across case/controls.
  6. Replication with external dataset
    • Used PD case/control study available on recount2 (ctrl = 44, case = 29). Study SRP058181 on https://jhubiostatistics.shinyapps.io/recount/.
    • Used this dataset for validation of Scaden-predicted cell-type proportions & differential splicing.
  7. Integration with GWAS
    • Heritability enrichment analyses using stratified LD score regression.
    • Genetic association analyses using H-MAGMA. H-MAGMA advances MAGMA by incorporating chromatin interaction profiles from human brain tissue.
  8. Gene set enrichment

3 Data pre-processing

3.1 File transfer

3.2 Pre-alignment QC

  • Fastq files were de-multiplexed by ICR sequencing centre, using bcl2fastq and the following flags.
    • --mask-short-adapter-reads 0
    • --use-bases-mask Y*,I8Y7,I8,Y*
    • --no-lane-splitting
    • --output-dir
    • --runfolder-dir
    • --sample-sheet
  • See the following files for a nohup log of this provided by the ICR: /data/RNAseq_PD/tissue_polyA_samples/raw_data/bcl2fastq_demultiplex_nohup_1091.e.
  • Samples were trimmed using fastp (v 0.20.0).
# prefix to file names always NM followed by 4 digits, therefore "NM...._", will remove anything between 'NM' and '_' 
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/QC/prealignmentQC_fastp_PEadapters.R \
/data/RNAseq_PD/tissue_polyA_samples/raw_data \
/data/RNAseq_PD/tissue_polyA_samples/QC \
PD_tissue_polyA \
NM...._ \
_.* \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_fastp.log&

3.2.1 Results

3.3 Alignment

  • Using STAR (v2.7.0a)
  • Read groups generated and added to BAMs during STAR alignment to allow for future de-duplication, if necessary. Generation of read group headings requires information regarding flow cell, lane and adaptors for each sample, which can be found within the raw files from the sequencing provider. For this project this is summarised here: download flowcell information
  • Aligned to genome build GRCh38, and ensembl v97 .gtf file used for annotation.
  • Multi-sample 2-pass star was run. This is performed in three steps:
    1. Run 1st mapping pass for all samples with "usual" parameters.
    2. Merge all SJ.out.tab files from all samples and remove duplicated junctions.
    3. Run 2nd mapping pass for all samples with "usual" parameters, and the addition of the merged SJ.out.tab file in the --sjdbFileChrStartEnd flag.
  • Undetermined fastq (i.e. reads that weren't assigned to a sample) not aligned.

3.3.1 STAR 1st pass

# STAR 1st pass
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_alignment_withReadGroups_multi2pass.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/data/STAR_data/genome_index_hg38_ens_v97/sjdbOverhang_99 \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--read_groups=/home/rreynolds/projects/Aim2_PDsequencing_wd/raw_data/sample_details/Flowcell_info.txt \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_STAR_1pass.log&

# SJ.out.tab files moved to SJ_out_1pass folder
mv /data/RNAseq_PD/tissue_polyA_samples/STAR/*.SJ.out.tab /data/RNAseq_PD/tissue_polyA_samples/STAR/SJ_out_1pass/

3.3.2 Merging and filtering splice junctions

nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_splice_junction_merge.R \
/data/RNAseq_PD/tissue_polyA_samples/STAR/SJ_out_1pass \
-o /data/RNAseq_PD/tissue_polyA_samples/STAR/ \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_SJ_filtering.log&

3.3.3 STAR 2nd pass

# STAR 2nd pass
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_alignment_withReadGroups_multi2pass.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/data/STAR_data/genome_index_hg38_ens_v97/sjdbOverhang_99 \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--read_groups=/home/rreynolds/projects/Aim2_PDsequencing_wd/raw_data/sample_details/Flowcell_info.txt \
--sj_file=/data/RNAseq_PD/tissue_polyA_samples/STAR/merged_junctions.SJ.out.tab \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_STAR_2pass.log&

3.4 Post-alignment QC

  • Script includes sorting and indexing of .bam files with samtools, as most downstream programs require sorted and indexed bam files.
# Post-alignment QC script
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/QC/post_alignment_QC_RSeQC.R \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
/data/RNAseq_PD/tissue_polyA_samples/QC/ \
/data/references/ensembl/bed/v97/ensembl_GRCh38_v97.bed \
100 \
--sample_suffix=_Aligned.sortedByCoord.out.bam \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_postalignment_QC.log&

# MultiQC for full report
multiqc /data/RNAseq_PD/tissue_polyA_samples/ \
-o /data/RNAseq_PD/tissue_polyA_samples/QC/multiqc/ \
--ignore /data/RNAseq_PD/tissue_polyA_samples/raw_data/ \
-n PD_tissue_polyA_full_report

3.4.1 Results

  • QC report: download multiqc report
  • In general, QC looks fine.
    • Observe a 3'-bias over gene body coverage, which is unsurprising given library construction protocol. Important to use count models that account for this (e.g. Salmon).
    • Interesting to note the junction saturation profile for most samples -- novel junctions do not appear to saturate at 100% of the reads. Also, increasing the number of reads disproportionately affects the number of observed novel junctions vs observed known junctions i.e. the rate of increase in the number observed is higher for novel junctions than known junctions.

4 Gene-level quantification

  • Salmon used as:
    1. It is faster (due to quasi-mapping)
    2. It corrects for a lot of biases in data.
    3. It's output can be combined with DESeq2.
  • For details of workflow, see: Salmon_quantification.html
  • Note: For all downstream analyses (gene-level and splicing), ENCODE blacklist regions were removed from the analysis. It is worth noting that these blacklisted regions do overlap some known PD-associated genes. For details, see: PD_overlap_with_blacklist.html

4.1 Post-quantification QC

  • As might be expected, distribution of counts (log2 counts per million, log2CPM) demonstrates that samples with a median log2CPM lower than the median across all samples typically also have a lower RIN.
  • All samples' gene expression of XIST and DDX3Y agree with sex assigned in sample information.
  • Using uncorrected counts (i.e. not corrected by sex, RIN, etc.), samples appear to be clustering by RIN. RIN is also highly correlated to PC1, while PC2 is correlated with AoD. Based on these analyses, have decided to correct for RIN, sex and AoD in downstream analyses (in addition to cell proportions, once these are estimated).
  • For details, see: Post_quantification_QC.html

5 Deconvolution

6 Differential gene expression analyses with DESeq2

  • DESeq2 takes as input un-normalised counts from a variety of sources (e.g. Salmon, HTSeq)(see here)
  • For details of workflow, see: DESeq2_DEGanalysis.html

7 Differential splicing with LeafCutter

  • For details of leafcutter results and pathway analysis, see: Leafcutter.html
  • For details of intron annotation of leafcutter results, see: Leafcutter_intron_annotation.html
  • Also used STAR junction output (i.e. SJ.out.tab files) to determine proportions of annotated/partially annotated/unannotated junctions across groups. This analysis used all junctions, regardless of whether they were found to be differentially spliced across groups. For details, see: STAR_intron_annotation.html

7.1 Enrichment of U2/U12-type introns

  • Used introns/clusters derived from Leafcutter to interrogate enrichment of U2/U12-type introns. For details, see: Leafcutter_spliceosome.html

7.2 RNA-binding protein motif enrichment

8 Replication with external dataset

  • Replication with recount study, SRP058181.
  • For further details and overview, see: SRP058181.html

9 Integration with GWAS

10 Scripts and logs

  • Scripts used in the pre-processing of this data are available in RNASeqProcessing.
  • Additional ad-hoc scripts used within the .html files linked above can be found in the misc_scripts directory of this project. Many processes were run with logs, which can be found here in the nohug_logs directory of this project

11 Manuscript preparation

11.1 Figures and tables

11.2 Data submission

12 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MarkerGenes_0.0.0.9000      EWCE_0.99.2                
##  [3] rjson_0.2.20                readxl_1.3.1               
##  [5] forcats_0.5.1               stringr_1.4.0              
##  [7] dplyr_1.0.2                 purrr_0.3.4                
##  [9] readr_1.4.0                 tidyr_1.1.1                
## [11] tibble_3.0.3                ggplot2_3.3.2              
## [13] tidyverse_1.3.0             ggsci_2.9                  
## [15] DT_0.15                     devtools_2.3.2             
## [17] usethis_1.6.3               DEGreport_1.22.0           
## [19] DESeq2_1.26.0               SummarizedExperiment_1.16.1
## [21] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [23] matrixStats_0.56.0          Biobase_2.46.0             
## [25] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [27] IRanges_2.20.2              S4Vectors_0.24.4           
## [29] BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.1.0            RSQLite_2.2.0              
##   [3] AnnotationDbi_1.48.0        htmlwidgets_1.5.3          
##   [5] grid_3.6.1                  munsell_0.5.0              
##   [7] codetools_0.2-16            withr_2.2.0                
##   [9] colorspace_2.0-0            highr_0.8                  
##  [11] knitr_1.29                  rstudioapi_0.13            
##  [13] labeling_0.4.2              lasso2_1.2-20              
##  [15] GenomeInfoDbData_1.2.2      mnormt_2.0.2               
##  [17] farver_2.0.3                bit64_4.0.2                
##  [19] rprojroot_2.0.2             vctrs_0.3.2                
##  [21] generics_0.0.2              xfun_0.16                  
##  [23] BiocFileCache_1.10.2        R6_2.5.0                   
##  [25] doParallel_1.0.15           clue_0.3-57                
##  [27] locfit_1.5-9.4              bitops_1.0-6               
##  [29] cachem_1.0.3                reshape_0.8.8              
##  [31] assertthat_0.2.1            scales_1.1.1               
##  [33] nnet_7.3-12                 gtable_0.3.0               
##  [35] Cairo_1.5-12.2              processx_3.4.5             
##  [37] rlang_0.4.7                 genefilter_1.68.0          
##  [39] GlobalOptions_0.1.2         splines_3.6.1              
##  [41] broom_0.7.0                 checkmate_2.0.0            
##  [43] yaml_2.2.1                  reshape2_1.4.4             
##  [45] modelr_0.1.8                crosstalk_1.1.0.1          
##  [47] backports_1.1.8             Hmisc_4.4-1                
##  [49] tools_3.6.1                 bookdown_0.21              
##  [51] psych_2.0.7                 logging_0.10-108           
##  [53] ellipsis_0.3.1              RColorBrewer_1.1-2         
##  [55] ggdendro_0.1.21             sessioninfo_1.1.1          
##  [57] Rcpp_1.0.5                  plyr_1.8.6                 
##  [59] base64enc_0.1-3             progress_1.2.2             
##  [61] zlibbioc_1.32.0             RCurl_1.98-1.2             
##  [63] ps_1.3.4                    prettyunits_1.1.1          
##  [65] rpart_4.1-15                openssl_1.4.2              
##  [67] GetoptLong_1.0.2            cowplot_1.0.0              
##  [69] haven_2.3.1                 ggrepel_0.8.2              
##  [71] cluster_2.1.0               fs_1.5.0                   
##  [73] here_1.0.0                  magrittr_2.0.1             
##  [75] data.table_1.13.0           circlize_0.4.10            
##  [77] reprex_1.0.0                tmvnsim_1.0-2              
##  [79] pkgload_1.1.0               hms_1.0.0                  
##  [81] mime_0.9                    evaluate_0.14              
##  [83] xtable_1.8-4                XML_3.99-0.3               
##  [85] jpeg_0.1-8.1                gridExtra_2.3              
##  [87] shape_1.4.4                 testthat_2.3.2             
##  [89] compiler_3.6.1              biomaRt_2.42.1             
##  [91] crayon_1.4.1                htmltools_0.5.1.1          
##  [93] Formula_1.2-4               geneplotter_1.64.0         
##  [95] lubridate_1.7.9             DBI_1.1.1                  
##  [97] dbplyr_1.4.4                ComplexHeatmap_2.7.7       
##  [99] MASS_7.3-51.4               rappdirs_0.3.1             
## [101] Matrix_1.2-17               cli_2.2.0.9000             
## [103] pkgconfig_2.0.3             foreign_0.8-72             
## [105] xml2_1.3.2                  foreach_1.5.0              
## [107] annotate_1.64.0             XVector_0.26.0             
## [109] rvest_0.3.6                 callr_3.5.1                
## [111] digest_0.6.27               ConsensusClusterPlus_1.50.0
## [113] rmarkdown_2.5               cellranger_1.1.0           
## [115] HGNChelper_0.8.1            htmlTable_2.0.1            
## [117] edgeR_3.28.1                curl_4.3                   
## [119] lifecycle_0.2.0             nlme_3.1-141               
## [121] jsonlite_1.7.1              desc_1.2.0                 
## [123] askpass_1.1                 limma_3.42.2               
## [125] pillar_1.4.6                lattice_0.20-38            
## [127] Nozzle.R1_1.1-1             fastmap_1.0.1              
## [129] httr_1.4.2                  pkgbuild_1.1.0             
## [131] survival_3.2-7              glue_1.4.2                 
## [133] remotes_2.2.0               png_0.1-7                  
## [135] iterators_1.0.12            bit_4.0.4                  
## [137] stringi_1.5.3               blob_1.2.1                 
## [139] latticeExtra_0.6-29         memoise_2.0.0