Aim: identification of (i) differentially expressed genes and (ii) differentially spliced genes, and their associated molecular pathways, that distinguish between control, PD, PDD and DLB.

1 File paths for workflow

source(here::here("R", "file_paths.R"))

2 Background

2.1 Sample details and demographics

2.1.1 Column descriptions

2.1.2 Table of samples

2.1.3 Correlations between covariates

Using the package DEGreport can easily plot the correlation amongst covariates.

dds <- 
  readRDS(
    file.path(
      path_to_results,
      "Salmon_quant/bulk_tissue/PD_tissue_polyA_DESeqDataSet_design_SexRINAoD.Rds"
    )
  )

DEGreport::degCorCov(colData(dds))

Figure 2.1: Plot of correlations between sample covariates.

As might be expected, alpha-synuclein staging was correlated with disease group. Interestingly, OPC proportions were also correlated with disease group. In addition, proportions of excitatory and inhibitory neurons were correlated.

2.2 Methodological workflow

All steps below, describe the workflow for bulk-tissue RNA-sequencing:

Data pre-processing
- Pre-alignment QC using fastp combined with fastQC.
- Alignment using STAR.
- Post-alignment QC using RNASeqQC.
Gene-level quantification
- Salmon used to quantify gene-level reads.
Estimation of cell-type proportions (deconvolution)
- Estimate cell-type proportions for use in differential expression/splicing models.
- Used Scaden.
Differential gene-level expression
- DESeq2 used for differential expression analyses.
Differential splicing
- Leafcutter used. Defines intron clusters by shared junctions and calculates differential usage of introns within a cluster across case/controls.
Replication with external dataset
- Used PD case/control study available on recount2 (ctrl = 44, case = 29). Study SRP058181 on https://jhubiostatistics.shinyapps.io/recount/.
- Used this dataset for validation of Scaden-predicted cell-type proportions & differential splicing.
Integration with GWAS
- Heritability enrichment analyses using stratified LD score regression.
- Genetic association analyses using H-MAGMA. H-MAGMA advances MAGMA by incorporating chromatin interaction profiles from human brain tissue.
Gene set enrichment
- Used clusterProfiler.

3 Data pre-processing

3.1 File transfer

For details of transfer and appropriate checks refer to: RNAseq_data_transfer.html

3.2 Pre-alignment QC

Fastq files were de-multiplexed by ICR sequencing centre, using bcl2fastq and the following flags.
- --mask-short-adapter-reads 0
- --use-bases-mask Y*,I8Y7,I8,Y*
- --no-lane-splitting
- --output-dir
- --runfolder-dir
- --sample-sheet
See the following files for a nohup log of this provided by the ICR: /data/RNAseq_PD/tissue_polyA_samples/raw_data/bcl2fastq_demultiplex_nohup_1091.e.
Samples were trimmed using fastp (v 0.20.0).

# prefix to file names always NM followed by 4 digits, therefore "NM...._", will remove anything between 'NM' and '_' 
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/QC/prealignmentQC_fastp_PEadapters.R \
/data/RNAseq_PD/tissue_polyA_samples/raw_data \
/data/RNAseq_PD/tissue_polyA_samples/QC \
PD_tissue_polyA \
NM...._ \
_.* \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_fastp.log&

3.2.1 Results

QC report: download multiqc report
Per base sequence content: 24 samples failed and 26 samples with a warning.
- This warning is often triggered due to the biases of random hexamer priming (https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/).
- What is more concerning is that ACGT are not evenly called across read pairs. Instead, G/C appear to be underrepresented/overrepresented depending on which of the two reads in a read pair is being analysed. The NovaSeq is a two-channel system, which is known to overcall Gs (https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/). Speaking to Sabri at the UCL Institute for Cancer Research, the per base sequencing content can have quite varied distribution of nucleotide content in NovaSeq runs and if GC content is normal then there is no cause for concern.

3.3 Alignment

Using STAR (v2.7.0a)
Read groups generated and added to BAMs during STAR alignment to allow for future de-duplication, if necessary. Generation of read group headings requires information regarding flow cell, lane and adaptors for each sample, which can be found within the raw files from the sequencing provider. For this project this is summarised here: download flowcell information
Aligned to genome build GRCh38, and ensembl v97 .gtf file used for annotation.
Multi-sample 2-pass star was run. This is performed in three steps:
1. Run 1st mapping pass for all samples with "usual" parameters.
2. Merge all SJ.out.tab files from all samples and remove duplicated junctions.
3. Run 2nd mapping pass for all samples with "usual" parameters, and the addition of the merged SJ.out.tab file in the --sjdbFileChrStartEnd flag.
Undetermined fastq (i.e. reads that weren't assigned to a sample) not aligned.

3.3.1 STAR 1st pass

# STAR 1st pass
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_alignment_withReadGroups_multi2pass.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/data/STAR_data/genome_index_hg38_ens_v97/sjdbOverhang_99 \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--read_groups=/home/rreynolds/projects/Aim2_PDsequencing_wd/raw_data/sample_details/Flowcell_info.txt \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_STAR_1pass.log&

# SJ.out.tab files moved to SJ_out_1pass folder
mv /data/RNAseq_PD/tissue_polyA_samples/STAR/*.SJ.out.tab /data/RNAseq_PD/tissue_polyA_samples/STAR/SJ_out_1pass/

3.3.2 Merging and filtering splice junctions

For details, refer to html report: STAR_splice_junction_filtering.html

nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_splice_junction_merge.R \
/data/RNAseq_PD/tissue_polyA_samples/STAR/SJ_out_1pass \
-o /data/RNAseq_PD/tissue_polyA_samples/STAR/ \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_SJ_filtering.log&

3.3.3 STAR 2nd pass

# STAR 2nd pass
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/alignment/STAR_alignment_withReadGroups_multi2pass.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/data/STAR_data/genome_index_hg38_ens_v97/sjdbOverhang_99 \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--read_groups=/home/rreynolds/projects/Aim2_PDsequencing_wd/raw_data/sample_details/Flowcell_info.txt \
--sj_file=/data/RNAseq_PD/tissue_polyA_samples/STAR/merged_junctions.SJ.out.tab \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_STAR_2pass.log&

3.4 Post-alignment QC

Script includes sorting and indexing of .bam files with samtools, as most downstream programs require sorted and indexed bam files.

# Post-alignment QC script
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/QC/post_alignment_QC_RSeQC.R \
/data/RNAseq_PD/tissue_polyA_samples/STAR \
/data/RNAseq_PD/tissue_polyA_samples/QC/ \
/data/references/ensembl/bed/v97/ensembl_GRCh38_v97.bed \
100 \
--sample_suffix=_Aligned.sortedByCoord.out.bam \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_postalignment_QC.log&

# MultiQC for full report
multiqc /data/RNAseq_PD/tissue_polyA_samples/ \
-o /data/RNAseq_PD/tissue_polyA_samples/QC/multiqc/ \
--ignore /data/RNAseq_PD/tissue_polyA_samples/raw_data/ \
-n PD_tissue_polyA_full_report

3.4.1 Results

QC report: download multiqc report
In general, QC looks fine.
- Observe a 3'-bias over gene body coverage, which is unsurprising given library construction protocol. Important to use count models that account for this (e.g. Salmon).
- Interesting to note the junction saturation profile for most samples -- novel junctions do not appear to saturate at 100% of the reads. Also, increasing the number of reads disproportionately affects the number of observed novel junctions vs observed known junctions i.e. the rate of increase in the number observed is higher for novel junctions than known junctions.

4 Gene-level quantification

Salmon used as:
1. It is faster (due to quasi-mapping)
2. It corrects for a lot of biases in data.
3. It's output can be combined with DESeq2.
For details of workflow, see: Salmon_quantification.html
Note: For all downstream analyses (gene-level and splicing), ENCODE blacklist regions were removed from the analysis. It is worth noting that these blacklisted regions do overlap some known PD-associated genes. For details, see: PD_overlap_with_blacklist.html

4.1 Post-quantification QC

As might be expected, distribution of counts (log2 counts per million, log2CPM) demonstrates that samples with a median log2CPM lower than the median across all samples typically also have a lower RIN.
All samples' gene expression of XIST and DDX3Y agree with sex assigned in sample information.
Using uncorrected counts (i.e. not corrected by sex, RIN, etc.), samples appear to be clustering by RIN. RIN is also highly correlated to PC1, while PC2 is correlated with AoD. Based on these analyses, have decided to correct for RIN, sex and AoD in downstream analyses (in addition to cell proportions, once these are estimated).
For details, see: Post_quantification_QC.html

5 Deconvolution

Used Scaden.
Summary of workflows:
- deconvolution_scaden.html
- deconvolution_scaden_rerun.html - following decision to cluster endothelial cells and pericytes together into one vascular cell cluster, deconvolution was re-run.
Worth noting that when we correct gene expression for cell-type, disease groups cluster separately on plot of PC1 and PC2 axes. For details, see: Post_correction_PCA.html

6 Differential gene expression analyses with DESeq2

DESeq2 takes as input un-normalised counts from a variety of sources (e.g. Salmon, HTSeq)(see here)
For details of workflow, see: DESeq2_DEGanalysis.html

7 Differential splicing with LeafCutter

For details of leafcutter results and pathway analysis, see: Leafcutter.html
For details of intron annotation of leafcutter results, see: Leafcutter_intron_annotation.html
Also used STAR junction output (i.e. SJ.out.tab files) to determine proportions of annotated/partially annotated/unannotated junctions across groups. This analysis used all junctions, regardless of whether they were found to be differentially spliced across groups. For details, see: STAR_intron_annotation.html

7.1 Enrichment of U2/U12-type introns

Used introns/clusters derived from Leafcutter to interrogate enrichment of U2/U12-type introns. For details, see: Leafcutter_spliceosome.html

7.2 RNA-binding protein motif enrichment

Used introns/clusters derived from Leafcutter to interrogate RNA-binding protein motif enrichment. For details, see: rbp_differential_splicing.html

8 Replication with external dataset

Replication with recount study, SRP058181.
For further details and overview, see: SRP058181.html

9 Integration with GWAS

sLDSC
- GWAS formatted ready for sLDSC analyses. See: FormattingGWASforLDSC.html
- sLDSC applied to cell-type specificity deciles: LDSC_cell_type_specificity_deciles.html
- sLDSC applied to cell-type differential expression: LDSC_cell_type_DEG.html

10 Scripts and logs

Scripts used in the pre-processing of this data are available in RNASeqProcessing.
Additional ad-hoc scripts used within the .html files linked above can be found in the misc_scripts directory of this project. Many processes were run with logs, which can be found here in the nohug_logs directory of this project

11 Manuscript preparation

11.1 Figures and tables

The workflow used to produce figures and tables for the manuscript is available here: manuscript_figures.html

11.2 Data submission

For details of EGA submission, refer to: RNAseq_EGA_submission.html

12 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MarkerGenes_0.0.0.9000      EWCE_0.99.2                
##  [3] rjson_0.2.20                readxl_1.3.1               
##  [5] forcats_0.5.1               stringr_1.4.0              
##  [7] dplyr_1.0.2                 purrr_0.3.4                
##  [9] readr_1.4.0                 tidyr_1.1.1                
## [11] tibble_3.0.3                ggplot2_3.3.2              
## [13] tidyverse_1.3.0             ggsci_2.9                  
## [15] DT_0.15                     devtools_2.3.2             
## [17] usethis_1.6.3               DEGreport_1.22.0           
## [19] DESeq2_1.26.0               SummarizedExperiment_1.16.1
## [21] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [23] matrixStats_0.56.0          Biobase_2.46.0             
## [25] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [27] IRanges_2.20.2              S4Vectors_0.24.4           
## [29] BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.1.0            RSQLite_2.2.0              
##   [3] AnnotationDbi_1.48.0        htmlwidgets_1.5.3          
##   [5] grid_3.6.1                  munsell_0.5.0              
##   [7] codetools_0.2-16            withr_2.2.0                
##   [9] colorspace_2.0-0            highr_0.8                  
##  [11] knitr_1.29                  rstudioapi_0.13            
##  [13] labeling_0.4.2              lasso2_1.2-20              
##  [15] GenomeInfoDbData_1.2.2      mnormt_2.0.2               
##  [17] farver_2.0.3                bit64_4.0.2                
##  [19] rprojroot_2.0.2             vctrs_0.3.2                
##  [21] generics_0.0.2              xfun_0.16                  
##  [23] BiocFileCache_1.10.2        R6_2.5.0                   
##  [25] doParallel_1.0.15           clue_0.3-57                
##  [27] locfit_1.5-9.4              bitops_1.0-6               
##  [29] cachem_1.0.3                reshape_0.8.8              
##  [31] assertthat_0.2.1            scales_1.1.1               
##  [33] nnet_7.3-12                 gtable_0.3.0               
##  [35] Cairo_1.5-12.2              processx_3.4.5             
##  [37] rlang_0.4.7                 genefilter_1.68.0          
##  [39] GlobalOptions_0.1.2         splines_3.6.1              
##  [41] broom_0.7.0                 checkmate_2.0.0            
##  [43] yaml_2.2.1                  reshape2_1.4.4             
##  [45] modelr_0.1.8                crosstalk_1.1.0.1          
##  [47] backports_1.1.8             Hmisc_4.4-1                
##  [49] tools_3.6.1                 bookdown_0.21              
##  [51] psych_2.0.7                 logging_0.10-108           
##  [53] ellipsis_0.3.1              RColorBrewer_1.1-2         
##  [55] ggdendro_0.1.21             sessioninfo_1.1.1          
##  [57] Rcpp_1.0.5                  plyr_1.8.6                 
##  [59] base64enc_0.1-3             progress_1.2.2             
##  [61] zlibbioc_1.32.0             RCurl_1.98-1.2             
##  [63] ps_1.3.4                    prettyunits_1.1.1          
##  [65] rpart_4.1-15                openssl_1.4.2              
##  [67] GetoptLong_1.0.2            cowplot_1.0.0              
##  [69] haven_2.3.1                 ggrepel_0.8.2              
##  [71] cluster_2.1.0               fs_1.5.0                   
##  [73] here_1.0.0                  magrittr_2.0.1             
##  [75] data.table_1.13.0           circlize_0.4.10            
##  [77] reprex_1.0.0                tmvnsim_1.0-2              
##  [79] pkgload_1.1.0               hms_1.0.0                  
##  [81] mime_0.9                    evaluate_0.14              
##  [83] xtable_1.8-4                XML_3.99-0.3               
##  [85] jpeg_0.1-8.1                gridExtra_2.3              
##  [87] shape_1.4.4                 testthat_2.3.2             
##  [89] compiler_3.6.1              biomaRt_2.42.1             
##  [91] crayon_1.4.1                htmltools_0.5.1.1          
##  [93] Formula_1.2-4               geneplotter_1.64.0         
##  [95] lubridate_1.7.9             DBI_1.1.1                  
##  [97] dbplyr_1.4.4                ComplexHeatmap_2.7.7       
##  [99] MASS_7.3-51.4               rappdirs_0.3.1             
## [101] Matrix_1.2-17               cli_2.2.0.9000             
## [103] pkgconfig_2.0.3             foreign_0.8-72             
## [105] xml2_1.3.2                  foreach_1.5.0              
## [107] annotate_1.64.0             XVector_0.26.0             
## [109] rvest_0.3.6                 callr_3.5.1                
## [111] digest_0.6.27               ConsensusClusterPlus_1.50.0
## [113] rmarkdown_2.5               cellranger_1.1.0           
## [115] HGNChelper_0.8.1            htmlTable_2.0.1            
## [117] edgeR_3.28.1                curl_4.3                   
## [119] lifecycle_0.2.0             nlme_3.1-141               
## [121] jsonlite_1.7.1              desc_1.2.0                 
## [123] askpass_1.1                 limma_3.42.2               
## [125] pillar_1.4.6                lattice_0.20-38            
## [127] Nozzle.R1_1.1-1             fastmap_1.0.1              
## [129] httr_1.4.2                  pkgbuild_1.1.0             
## [131] survival_3.2-7              glue_1.4.2                 
## [133] remotes_2.2.0               png_0.1-7                  
## [135] iterators_1.0.12            bit_4.0.4                  
## [137] stringi_1.5.3               blob_1.2.1                 
## [139] latticeExtra_0.6-29         memoise_2.0.0

Overview for bulk tissue samples

Regina H. Reynolds

1 File paths for workflow

2 Background

2.1 Sample details and demographics

2.1.1 Column descriptions

2.1.2 Table of samples

2.1.3 Correlations between covariates

2.2 Methodological workflow

3 Data pre-processing

3.1 File transfer

3.2 Pre-alignment QC

3.2.1 Results

3.3 Alignment

3.3.1 STAR 1st pass

3.3.2 Merging and filtering splice junctions

3.3.3 STAR 2nd pass

3.4 Post-alignment QC

3.4.1 Results

4 Gene-level quantification

4.1 Post-quantification QC

5 Deconvolution

6 Differential gene expression analyses with DESeq2

7 Differential splicing with LeafCutter

7.1 Enrichment of U2/U12-type introns

7.2 RNA-binding protein motif enrichment

8 Replication with external dataset

9 Integration with GWAS

10 Scripts and logs

11 Manuscript preparation

11.1 Figures and tables

11.2 Data submission

12 Session info