Aim: transfer of data to EGA

1 File paths for workflow

source(here::here("R", "file_paths.R"))

path_for_tables <- file.path(path_to_wd, "paper_draft/tables")
path_to_encrypted_files <- file.path(path_to_bulk_seq_data, "QC/fastp_encrypted")

2 File preparation and upload

2.1 Accepted file formats

We will be uploading fastqs (https://ega-archive.org/submission/sequence) that have had adapters removed and have been through QC (as per this script).

Requirements include:

  • Data files should be de-multiplexed prior to submission so that each run is submitted with files containing data for a single sample only.
  • Quality scores must be in Phred scale. For example, quality scores from early Solexa pipelines must be converted to use this scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
  • No technical reads (adapters, linkers, barcodes) are allowed.
  • Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
  • Paired reads must split and submitted using either one or two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2'.
  • The first line for each read must start with '@'.
  • The base calls and quality scores must be separated by a line starting with '+'.
  • The Fastq files must be compressed using gzip or bzip2.

2.2 Encryption

  1. Download EGACryptor and unzip
cd /home/rreynolds/tools/
wget https://ega-archive.org/files/EgaCryptor.zip
unzip EgaCryptor.zip
  1. Navigate to folder containing files for upload and create new directory for encrypted files. Copy files to be encrypted into new directory.
cd /data/RNAseq_PD/tissue_polyA_samples/QC
mkdir fastp_encrypted
cp /data/RNAseq_PD/tissue_polyA_samples/QC/fastp/NM*.fastq.gz /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/
  1. Using the file EgaCryptor.jar (the following command will encrypt all files from the specified folder), the files are encrypted. Will need to clarify in a readme.txt (or in the sample metadata) that naming for paired reads = R1 + R3. R2 = UMIs.
# -i flag specifies input, while -o flag specifies output folder
java -jar /home/rreynolds/tools/EGA-Cryptor-2.0.0/ega-cryptor-2.0.0.jar -i /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/ -o /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/
  1. Remove copied fastq.gz files

2.3 Upload files

  1. Open a terminal and connect to EGA using ftp. Important to enable passive mode with the -p flag. (https://stackoverflow.com/questions/19516263/200-port-command-successful-consider-using-pasv-425-failed-to-establish-connec.
# Enable passive 
ftp -p ftp.ega.ebi.ac.uk
  1. Enter your submission username and submission password.

  2. Type binary to enter binary mode for transfer (to see a list of available ftp commands type help). Type prompt to switch off confirmation for each file uploaded. Use mput command to upload files.

binary
prompt
mput *
  1. Use bye command to exit the ftp client.

3 Submitting metadata

3.1 Instructions

Followed all instructions here: https://ega-archive.org/submission/tools/submitter-portal.

3.2 Sample registration

Samples must be registered with the following information:

  • title
  • alias*
  • description
  • subjectId*
  • bioSample
  • caseOrControl
  • gender*
  • organismPart
  • cellLine
  • region
  • phenotype*

Field with an asterisk must be filled in. These fields can be populated using a .csv. Thus, used the sample info we have to create this .csv.

sample_info <- 
  readr::read_csv(
    file.path(path_for_tables, "/", "sample_info_bulk_seq_metrics.csv")
  )

ega_sample_info <- 
  sample_info %>% 
  dplyr::filter(sent_to_bulk_seq == "yes") %>% 
  dplyr::mutate(
    title = "",
    alias = sample_id,
    description = "",
    subjectId = sample_id,
    bioSample = "",
    caseOrControl =
      case_when(
        disease_group == "Control" ~ "control",
                TRUE ~ "case"
        ),
    gender = 
      case_when(
        Sex == "F" ~ "female",
        Sex == "M" ~ "male"
      ),
    organismPart = "anterior cingulate cortex",
    cellLine = "",
    region = "",
    phenotype = 
      case_when(
        disease_group == "PD" ~ "Parkinson's disease",
        disease_group == "PDD" ~ "Parkinson's disease with dementia",
        disease_group == "DLB" ~ "Dementia with Lewy bodies",
        TRUE ~ disease_group
        )
  ) %>% 
  dplyr::select(
    title, alias, description, subjectId, bioSample, caseOrControl, gender, organismPart, cellLine, region, phenotype
  )

ega_sample_info
readr::write_csv(
  ega_sample_info,
  file.path(path_to_results, "ega_sample_info.csv")
)

3.3 Sample and file linkage

Samples and files have to be linked. Fields can be populated using a .csv with the following column names:

  • Sample alias
  • First Fastq File
  • First Checksum
  • First Unencrypted checksum
  • Second Fastq File
  • Second Checksum
  • Second Unencrypted checksum
file_df <- 
  tibble(
    file_name = list.files(path_to_encrypted_files),
    file_type = list.files(path_to_encrypted_files) %>% 
      str_remove(".*.fastq."),
    read = case_when(
      str_detect(list.files(path_to_encrypted_files), "R1") ~ "R1",
      str_detect(list.files(path_to_encrypted_files), "R3") ~ "R2"
    ),
    sample_id = list.files(path_to_encrypted_files) %>% 
      str_remove("NM...._") %>% 
      str_remove("_.*")
  ) %>% 
  tidyr::pivot_wider(
    names_from = c(file_type, read),
    values_from = file_name
  )

sample_file_link <- 
  tibble(.rows = nrow(file_df)) %>% 
  dplyr::mutate(
    `Sample alias` = file_df$sample_id,
    `First Fastq File` = file_df$gz.gpg_R1, 
    `First Checksum` = "",  
    `First Unencrypted checksum` = "",  
    `Second Fastq File` = file_df$gz.gpg_R2,    
    `Second Checksum` = "", 
    `Second Unencrypted checksum` = ""
  )
  
for(i in 1:nrow(file_df)){
 
  sample_file_link[i, ] <-
    sample_file_link %>% 
    dplyr::slice(i) %>% 
    dplyr::mutate(
      `First Checksum` = 
        read_lines(
          file.path(path_to_encrypted_files, file_df$gz.gpg.md5_R1[i])
        ),  
      `First Unencrypted checksum` = 
        read_lines(
          file.path(path_to_encrypted_files, file_df$gz.md5_R1[i])
        ),  
      `Second Checksum` = 
        read_lines(
          file.path(path_to_encrypted_files, file_df$gz.gpg.md5_R2[i])
        ),  
      `Second Unencrypted checksum` = 
        read_lines(
          file.path(path_to_encrypted_files, file_df$gz.md5_R2[i])
        )
    )
  
}

sample_file_link
readr::write_csv(
  sample_file_link,
  file.path(path_to_results, "ega_sample_file_linkage.csv")
)

4 Session info

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       Ubuntu 16.04.6 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2021-05-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version    date       lib source                    
##  assertthat    0.2.1      2019-03-21 [2] CRAN (R 3.6.1)            
##  backports     1.1.8      2020-06-17 [1] CRAN (R 3.6.1)            
##  blob          1.2.1      2020-01-20 [1] CRAN (R 3.6.1)            
##  bookdown      0.21       2020-10-13 [1] CRAN (R 3.6.1)            
##  broom         0.7.0      2020-07-09 [1] CRAN (R 3.6.1)            
##  cellranger    1.1.0      2016-07-27 [2] CRAN (R 3.6.1)            
##  cli           2.2.0.9000 2021-01-22 [1] Github (r-lib/cli@4b41c51)
##  colorspace    2.0-0      2020-11-11 [2] CRAN (R 3.6.1)            
##  crayon        1.4.1      2021-02-08 [2] CRAN (R 3.6.1)            
##  DBI           1.1.1      2021-01-15 [2] CRAN (R 3.6.1)            
##  dbplyr        1.4.4      2020-05-27 [1] CRAN (R 3.6.1)            
##  digest        0.6.27     2020-10-24 [2] CRAN (R 3.6.1)            
##  dplyr       * 1.0.2      2020-08-18 [1] CRAN (R 3.6.1)            
##  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 3.6.1)            
##  evaluate      0.14       2019-05-28 [2] CRAN (R 3.6.1)            
##  forcats     * 0.5.1      2021-01-27 [2] CRAN (R 3.6.1)            
##  fs            1.5.0      2020-07-31 [1] CRAN (R 3.6.1)            
##  generics      0.0.2      2018-11-29 [1] CRAN (R 3.6.1)            
##  ggplot2     * 3.3.2      2020-06-19 [1] CRAN (R 3.6.1)            
##  glue          1.4.2      2020-08-27 [1] CRAN (R 3.6.1)            
##  gtable        0.3.0      2019-03-25 [2] CRAN (R 3.6.1)            
##  haven         2.3.1      2020-06-01 [1] CRAN (R 3.6.1)            
##  here          1.0.0      2020-11-15 [1] CRAN (R 3.6.1)            
##  hms           1.0.0      2021-01-13 [2] CRAN (R 3.6.1)            
##  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 3.6.1)            
##  httr          1.4.2      2020-07-20 [1] CRAN (R 3.6.1)            
##  jsonlite      1.7.1      2020-09-07 [1] CRAN (R 3.6.1)            
##  knitr         1.29       2020-06-23 [1] CRAN (R 3.6.1)            
##  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 3.6.1)            
##  lubridate     1.7.9      2020-06-08 [1] CRAN (R 3.6.1)            
##  magrittr      2.0.1      2020-11-17 [2] CRAN (R 3.6.1)            
##  modelr        0.1.8      2020-05-19 [1] CRAN (R 3.6.1)            
##  munsell       0.5.0      2018-06-12 [2] CRAN (R 3.6.1)            
##  pillar        1.4.6      2020-07-10 [1] CRAN (R 3.6.1)            
##  pkgconfig     2.0.3      2019-09-22 [2] CRAN (R 3.6.1)            
##  purrr       * 0.3.4      2020-04-17 [1] CRAN (R 3.6.1)            
##  R6            2.5.0      2020-10-28 [2] CRAN (R 3.6.1)            
##  Rcpp          1.0.5      2020-07-06 [1] CRAN (R 3.6.1)            
##  readr       * 1.4.0      2020-10-05 [2] CRAN (R 3.6.1)            
##  readxl        1.3.1      2019-03-13 [2] CRAN (R 3.6.1)            
##  reprex        1.0.0      2021-01-27 [2] CRAN (R 3.6.1)            
##  rlang         0.4.7      2020-07-09 [1] CRAN (R 3.6.1)            
##  rmarkdown     2.5        2020-10-21 [1] CRAN (R 3.6.1)            
##  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 3.6.1)            
##  rstudioapi    0.13       2020-11-12 [2] CRAN (R 3.6.1)            
##  rvest         0.3.6      2020-07-25 [1] CRAN (R 3.6.1)            
##  scales        1.1.1      2020-05-11 [1] CRAN (R 3.6.1)            
##  sessioninfo * 1.1.1      2018-11-05 [2] CRAN (R 3.6.1)            
##  stringi       1.5.3      2020-09-09 [2] CRAN (R 3.6.1)            
##  stringr     * 1.4.0      2019-02-10 [2] CRAN (R 3.6.1)            
##  tibble      * 3.0.3      2020-07-10 [1] CRAN (R 3.6.1)            
##  tidyr       * 1.1.1      2020-07-31 [1] CRAN (R 3.6.1)            
##  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 3.6.1)            
##  tidyverse   * 1.3.0      2019-11-21 [1] CRAN (R 3.6.1)            
##  vctrs         0.3.2      2020-07-15 [1] CRAN (R 3.6.1)            
##  withr         2.2.0      2020-04-20 [1] CRAN (R 3.6.1)            
##  xfun          0.16       2020-07-24 [1] CRAN (R 3.6.1)            
##  xml2          1.3.2      2020-04-23 [1] CRAN (R 3.6.1)            
##  yaml          2.2.1      2020-02-01 [1] CRAN (R 3.6.1)            
## 
## [1] /home/rreynolds/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library