Aim: to quantify reads across aligned samples.
Salmon
is a better choice re. quantification (compared to HTSeq-count
traditionally used in the DESeq2
package) as it is faster (due to quasi-mapping) and corrects for a lot of biases in data, including: - Sequence-specific biases - GC-biases - Positional biasesTo create a decoy-aware salmon index requires: - genome fasta - transcriptome fasta - annotation file (GTF) A decoy-aware transcriptome is recomended for selective alignment in Salmon from release 0.13.0 onwards. When a sequenced fragment originates from an unannotated genomic locus bearing sequence-similarity to an annotated transcript, it can be falsely mapped to the annotated transcript since the relevant genomic sequence is not available to the method. Using of MashMap, such sequence-similar decoy regions can be identified and extracted from the genome. The normal Salmon index is then augmented with these decoy sequences, which are handled in a special manner during mapping and alignment scoring, leading to a reduction in false mappings of such a nature.
cd /data/references/fasta/transcriptome/
wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
# Renamed to include version number
gunzip Homo_sapiens.GRCh38.97.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.97.ncrna.fa.gz
# Concatenated to one file
cat Homo_sapiens.GRCh38.97.cdna.all.fa Homo_sapiens.GRCh38.97.ncrna.fa > Homo_sapiens.GRCh38.97.cdna.all.ncrna.fa
# Used wc -l on all three files to check number of lines in concatenate file = number of lines in cdna + ncrna
cd /tools/salmon/salmonReferences/ensembl_v97
bash /tools/salmon/SalmonTools/scripts/generateDecoyTranscriptome.sh \
-j 10 \
-b /tools/bedtools2/bin/bedtools \
-m /tools/MashMap/MashMap-2.0/mashmap \
-a /data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.gtf \
-g /data/references/fasta/Homo_sapiens.GRCh38.97.dna.primary_assembly.fa \
-t /data/references/fasta/transcriptome/Homo_sapiens.GRCh38.97.cdna.all.ncrna.fa \
-o /tools/salmon/salmonReferences/ensembl_v97/
/tools/salmon/salmon-latest_linux_x86_64/bin/salmon index \
--transcripts /tools/salmon/salmonReferences/ensembl_v97/gentrome.fa \
--index /tools/salmon/salmonReferences/ensembl_v97/Homo_sapiens.GRCh38.97.cdna.all.ncrna_index \
--decoys /tools/salmon/salmonReferences/ensembl_v97/decoys.txt \
--kmerLen 31 # Recommended for > 75 bp reads
infer_experiment.py
script from RSeQC
to determine strandedness. Used the following command on the 5 control samples:/tools/RSeQC-2.6.4/scripts/infer_experiment.py -i PDC87_A1B4_GM-T_Aligned.sortedBysamtools.out.bam -r /data/references/ensembl/bed/v97/ensembl_GRCh38_v97.bed -s 10000000
# Load .gtf and filter for unique transcript ids
transcripts <-
readGFF(filepath = "/data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.gtf") %>%
dplyr::filter(!is.na(transcript_id) & !is.na(transcript_version)) %>%
tidyr::unite("combined_transcript_id",
c("transcript_id", "transcript_version"),
sep = ".",
remove = FALSE) %>%
dplyr::distinct(combined_transcript_id, .keep_all = TRUE)
# Create map and save
map <- transcripts %>%
dplyr::select(combined_transcript_id, gene_id)
write_delim(map,
path = "/tools/salmon/salmonReferences/ensembl_v97/transcript_to_gene_map.txt",
delim = "\t",
col_names = FALSE)
nohup Rscript \
/home/rreynolds/packages/RNAseqProcessing/quantification/quantification_Salmon.R \
/data/RNAseq_PD/tissue_polyA_samples/QC/fastp \
/tools/salmon/salmonReferences/ensembl_v97/Homo_sapiens.GRCh38.97.cdna.all.ncrna_index \
/data/RNAseq_PD/tissue_polyA_samples/salmon_quant \
--sample_prefix=NM...._ \
--sample_suffix=_S.* \
--library_type=ISF \
&>/home/rreynolds/projects/Aim2_PDsequencing_wd/LBD-seq-bulk-analyses/nohup_logs/PD_tissue_polyA_salmon_quant.log&
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] rtracklayer_1.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
## [4] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
## [7] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.2
## [10] purrr_0.3.4 readr_1.4.0 tidyr_1.1.1
## [13] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.46.0 httr_1.4.2
## [3] jsonlite_1.7.1 modelr_0.1.8
## [5] assertthat_0.2.1 blob_1.2.1
## [7] GenomeInfoDbData_1.2.2 cellranger_1.1.0
## [9] Rsamtools_2.2.3 yaml_2.2.1
## [11] pillar_1.4.6 backports_1.1.8
## [13] lattice_0.20-38 glue_1.4.2
## [15] digest_0.6.27 XVector_0.26.0
## [17] rvest_0.3.6 colorspace_2.0-0
## [19] htmltools_0.5.1.1 Matrix_1.2-17
## [21] XML_3.99-0.3 pkgconfig_2.0.3
## [23] broom_0.7.0 haven_2.3.1
## [25] bookdown_0.21 zlibbioc_1.32.0
## [27] scales_1.1.1 BiocParallel_1.20.1
## [29] generics_0.0.2 ellipsis_0.3.1
## [31] withr_2.2.0 SummarizedExperiment_1.16.1
## [33] cli_2.2.0.9000 magrittr_2.0.1
## [35] crayon_1.4.1 readxl_1.3.1
## [37] evaluate_0.14 fs_1.5.0
## [39] xml2_1.3.2 tools_3.6.1
## [41] hms_1.0.0 matrixStats_0.56.0
## [43] lifecycle_0.2.0 munsell_0.5.0
## [45] reprex_1.0.0 DelayedArray_0.12.3
## [47] Biostrings_2.54.0 compiler_3.6.1
## [49] rlang_0.4.7 grid_3.6.1
## [51] RCurl_1.98-1.2 rstudioapi_0.13
## [53] bitops_1.0-6 rmarkdown_2.5
## [55] gtable_0.3.0 DBI_1.1.1
## [57] R6_2.5.0 GenomicAlignments_1.22.1
## [59] lubridate_1.7.9 knitr_1.29
## [61] stringi_1.5.3 Rcpp_1.0.5
## [63] vctrs_0.3.2 dbplyr_1.4.4
## [65] tidyselect_1.1.0 xfun_0.16