1 Aims

  1. Transfer of data from hard drive to server.
  2. MD5 checksum to ensure proper transfer.
  3. Check file names to ensure that all samples sent to sequencing have been returned.

2 File paths for workflow

source(here::here("R", "file_paths.R"))

3 Transferring data

  • Used rsync to transfer data, which allows transfers to be paused and resumed (necessary due to time required for transfers, in addition to the necessity of computer being turned on + hard drive being plugged in).
  • Transfered was performed in chunks, .
    1. Transfer of all nuclear samples (~200Gb)
    2. Transfer of all tissue samples (~540Gb)
  • Commands entered using Mobaxterm local terminal.
  • Details for use of rsync can be found: https://ss64.com/bash/rsync.html
# Nuclear samples
rsync -avz --progress -e ssh /drives/d/PR0199/ username@mr_server:/home/rreynolds/data/PD_bulkRNAseq/nuclear_totalRNA_samples/ | tee /home/rhrey/Desktop/20190606_bulkNuc_data_transfer.txt

# Tissue samples 
rsync -avz --progress -e ssh /drives/d/PR0198/ username@mr_server:/home/rreynolds/data/PD_bulkRNAseq/tissue_polyA_samples/ | tee /home/rhrey/Desktop/20190606_tissue_data_transfer.txt

4 Transferring data 2.0

  • Original files from hard drive were not trimmed correctly, so de-multiplexing run again without trimming enabled at ICR.
  • Files were re-downloaded from ftp.icr.ac.uk, using ftp.

5 Files locations

  • Originally transferred to personal folder, but later moved to common data folder: /data/RNAseq_PD/. Files can be found within their respective project files tissue_polyA_samples/raw_daw and nuclear_totalRNA_samples/raw_daw
  • Note that R2 files relate to UMIs, therefore were moved to their UMIfolder within the raw_data folders.

6 MD5 checksum

source(here::here("R", "md5_checksum.R"))

file_paths <- list.files(path = "/data/RNAseq_PD", full.names = TRUE, pattern = ".fastq.gz", recursive = T)

original_md5 <- read_delim(file = "/data/RNAseq_PD/nuclear_totalRNA_samples/raw_data/md5sums_davros.txt", delim = " ", col_names = FALSE) %>% 
  dplyr::mutate(original_md5 = X1, file_name = X2) %>% 
  dplyr::select(-X1, -X2) %>% 
  bind_rows(read_delim(file = "/data/RNAseq_PD/tissue_polyA_samples/raw_data/md5sums_davros.txt", delim = " ", col_names = FALSE) %>% 
              dplyr::mutate(original_md5 = X1, file_name = X2) %>% 
              dplyr::select(-X1, -X2)) %>% 
  dplyr::mutate(file_name = str_replace(file_name, ".*/", ""),
                file_name = str_replace(file_name, "/.*/", ""),
                file_name = str_replace(file_name, " ", ""))
  
md5 <- md5_checksum(file_paths, original_md5, column_to_join_by = "file_name")

write_csv(md5, path = "/home/rreynolds/projects/Aim2_PDsequencing_wd/results/md5_check.csv")
  • All files were transferred without evidence of corruption, as is visible from table below.

7 File name check

# Sequencing files
file_paths <- list.files(
  path = file.path(path_to_bulk_seq_data, "QC/fastp"), 
  full.names = TRUE, 
  pattern = ".fastq.gz", 
  recursive = T
  )

# Filter out "Undetermined" files and extract unique file names
samples_received <- 
  file_paths %>% 
  .[!str_detect(.,"Undetermined")] %>% 
  str_remove("/.*/") %>% 
  str_remove("-T.*") %>% 
  str_replace("^[:alnum:]*", "") %>% 
  str_remove("^_") %>% 
  unique()

# Does this match sample names sent to sequencing?
samples_sent <-
  read_delim(
    file = file.path(
      path_to_raw_data, "sample_details/SamplesSentToSequencing.txt"), 
    delim = "\t", 
    col_names = FALSE
    ) %>% 
  .[["X1"]]

samples_sent[!str_detect(samples_sent, "BulkNuc")] %in% samples_received
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

8 Conclusions

  • Data transferred with no corrupt files.
  • All samples sent for sequencing received from sequencing centre.