Aim: to quantify reads across aligned samples.

1 Background

  • Quantified reads will be used for:
    1. Other QC, including:
      1. Sex checks
      2. PCA analysis for outliers
    2. Deconvolution
    3. Gene-level differential expression analyses
  • Salmon is a better choice re. quantification (compared to HTSeq-count traditionally used in the DESeq2 package) as it is faster (due to quasi-mapping) and corrects for a lot of biases in data, including: - Sequence-specific biases - GC-biases - Positional biases

2 Running Salmon

  • Using v0.14.1
  • While Salmon can be run using previously aligned .bam files, it requires that these were aligned directly to the transcriptome rather than to the genome. There are ways of converting genome-aligned .bam files; however, given the speed of Salmon mapping, opted to perform quasi-mapping and quantification (otherwise known as mapping-based mode in Salmon documentation - https://salmon.readthedocs.io/en/latest/index.html).
  • This can be performed in two steps:
    1. Create a salmon index for the transcriptome
    2. Quantification

2.1 Creating a salmon index

To create a decoy-aware salmon index requires: - genome fasta - transcriptome fasta - annotation file (GTF) A decoy-aware transcriptome is recomended for selective alignment in Salmon from release 0.13.0 onwards. When a sequenced fragment originates from an unannotated genomic locus bearing sequence-similarity to an annotated transcript, it can be falsely mapped to the annotated transcript since the relevant genomic sequence is not available to the method. Using of MashMap, such sequence-similar decoy regions can be identified and extracted from the genome. The normal Salmon index is then augmented with these decoy sequences, which are handled in a special manner during mapping and alignment scoring, leading to a reduction in false mappings of such a nature.

  1. Transcriptome fasta was created by concatenating the cDNA and ncRNA fastas from ensembl v97.
cd /data/references/fasta/transcriptome/

wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz

# Renamed to include version number
gunzip Homo_sapiens.GRCh38.97.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.97.ncrna.fa.gz

# Concatenated to one file
cat Homo_sapiens.GRCh38.97.cdna.all.fa Homo_sapiens.GRCh38.97.ncrna.fa > Homo_sapiens.GRCh38.97.cdna.all.ncrna.fa

# Used wc -l on all three files to check number of lines in concatenate file = number of lines in cdna + ncrna
  1. Creating a decoy transcriptome file.
cd /tools/salmon/salmonReferences/ensembl_v97

bash /tools/salmon/SalmonTools/scripts/generateDecoyTranscriptome.sh \
-j 10 \ 
-b /tools/bedtools2/bin/bedtools \
-m /tools/MashMap/MashMap-2.0/mashmap \
-a /data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.gtf \
-g /data/references/fasta/Homo_sapiens.GRCh38.97.dna.primary_assembly.fa \
-t /data/references/fasta/transcriptome/Homo_sapiens.GRCh38.97.cdna.all.ncrna.fa \
-o /tools/salmon/salmonReferences/ensembl_v97/
  1. Generating the decoy-aware salmon index.
/tools/salmon/salmon-latest_linux_x86_64/bin/salmon index \
--transcripts /tools/salmon/salmonReferences/ensembl_v97/gentrome.fa \
--index /tools/salmon/salmonReferences/ensembl_v97/Homo_sapiens.GRCh38.97.cdna.all.ncrna_index \
--decoys /tools/salmon/salmonReferences/ensembl_v97/decoys.txt \
--kmerLen 31 # Recommended for > 75 bp reads

2.2 Quantification

2.2.1 Determining the library type

  • Should be ISF, based on the TruSeq protocol, which is often refered to as a FR Second Strand protocol.
  • To ensure this is the case, used the infer_experiment.py script from RSeQC to determine strandedness. Used the following command on the 5 control samples:
/tools/RSeQC-2.6.4/scripts/infer_experiment.py -i PDC87_A1B4_GM-T_Aligned.sortedBysamtools.out.bam -r /data/references/ensembl/bed/v97/ensembl_GRCh38_v97.bed -s 10000000

2.2.2 Creating a gene map

  • For gene quantification summarised at gene level, Salmon requires a transcript-to-gene map.
  • Tried using .gtf file, however, "transcript_id" does not include version number, which is provided in a separate entry ("transcript_version"). As Salmon uses transcript IDs that include the version number, needed to construct a simple .txt file with transcript ids (in the format ENST000000.x) and their corresponding gene IDs (in the format ENS0000000).
# Load .gtf and filter for unique transcript ids
transcripts <- 
  readGFF(filepath = "/data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.gtf") %>% 
  dplyr::filter(!is.na(transcript_id) & !is.na(transcript_version)) %>% 
  tidyr::unite("combined_transcript_id", 
               c("transcript_id", "transcript_version"), 
               sep = ".", 
               remove = FALSE) %>% 
  dplyr::distinct(combined_transcript_id, .keep_all = TRUE)

# Create map and save
map <- transcripts %>% 
  dplyr::select(combined_transcript_id, gene_id)

write_delim(map, 
            path = "/tools/salmon/salmonReferences/ensembl_v97/transcript_to_gene_map.txt", 
            delim = "\t", 
            col_names = FALSE)

2.2.3 Running Salmon quantification

nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/quantification/quantification_Salmon.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/tools/salmon/salmonReferences/ensembl_v97/Homo_sapiens.GRCh38.97.cdna.all.ncrna_index \
/data/RNAseq_PD/tissue_polyA_samples/salmon_quant \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--library_type=ISF \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_salmon_quant.log&

3 Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rtracklayer_1.46.0   GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 
##  [4] IRanges_2.20.2       S4Vectors_0.24.4     BiocGenerics_0.32.0 
##  [7] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.2         
## [10] purrr_0.3.4          readr_1.4.0          tidyr_1.1.1         
## [13] tibble_3.0.3         ggplot2_3.3.2        tidyverse_1.3.0     
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.46.0              httr_1.4.2                 
##  [3] jsonlite_1.7.1              modelr_0.1.8               
##  [5] assertthat_0.2.1            blob_1.2.1                 
##  [7] GenomeInfoDbData_1.2.2      cellranger_1.1.0           
##  [9] Rsamtools_2.2.3             yaml_2.2.1                 
## [11] pillar_1.4.6                backports_1.1.8            
## [13] lattice_0.20-38             glue_1.4.2                 
## [15] digest_0.6.27               XVector_0.26.0             
## [17] rvest_0.3.6                 colorspace_2.0-0           
## [19] htmltools_0.5.1.1           Matrix_1.2-17              
## [21] XML_3.99-0.3                pkgconfig_2.0.3            
## [23] broom_0.7.0                 haven_2.3.1                
## [25] bookdown_0.21               zlibbioc_1.32.0            
## [27] scales_1.1.1                BiocParallel_1.20.1        
## [29] generics_0.0.2              ellipsis_0.3.1             
## [31] withr_2.2.0                 SummarizedExperiment_1.16.1
## [33] cli_2.2.0.9000              magrittr_2.0.1             
## [35] crayon_1.4.1                readxl_1.3.1               
## [37] evaluate_0.14               fs_1.5.0                   
## [39] xml2_1.3.2                  tools_3.6.1                
## [41] hms_1.0.0                   matrixStats_0.56.0         
## [43] lifecycle_0.2.0             munsell_0.5.0              
## [45] reprex_1.0.0                DelayedArray_0.12.3        
## [47] Biostrings_2.54.0           compiler_3.6.1             
## [49] rlang_0.4.7                 grid_3.6.1                 
## [51] RCurl_1.98-1.2              rstudioapi_0.13            
## [53] bitops_1.0-6                rmarkdown_2.5              
## [55] gtable_0.3.0                DBI_1.1.1                  
## [57] R6_2.5.0                    GenomicAlignments_1.22.1   
## [59] lubridate_1.7.9             knitr_1.29                 
## [61] stringi_1.5.3               Rcpp_1.0.5                 
## [63] vctrs_0.3.2                 dbplyr_1.4.4               
## [65] tidyselect_1.1.0            xfun_0.16