1 Aims

Transfer of data from hard drive to server.
MD5 checksum to ensure proper transfer.
Check file names to ensure that all samples sent to sequencing have been returned.

2 File paths for workflow

source(here::here("R", "file_paths.R"))

3 Transferring data

Used rsync to transfer data, which allows transfers to be paused and resumed (necessary due to time required for transfers, in addition to the necessity of computer being turned on + hard drive being plugged in).
Transfered was performed in chunks, .
1. Transfer of all nuclear samples (~200Gb)
2. Transfer of all tissue samples (~540Gb)
Commands entered using Mobaxterm local terminal.
Details for use of rsync can be found: https://ss64.com/bash/rsync.html

# Nuclear samples
rsync -avz --progress -e ssh /drives/d/PR0199/ username@mr_server:/home/rreynolds/data/PD_bulkRNAseq/nuclear_totalRNA_samples/ | tee /home/rhrey/Desktop/20190606_bulkNuc_data_transfer.txt

# Tissue samples 
rsync -avz --progress -e ssh /drives/d/PR0198/ username@mr_server:/home/rreynolds/data/PD_bulkRNAseq/tissue_polyA_samples/ | tee /home/rhrey/Desktop/20190606_tissue_data_transfer.txt

4 Transferring data 2.0

Original files from hard drive were not trimmed correctly, so de-multiplexing run again without trimming enabled at ICR.
Files were re-downloaded from ftp.icr.ac.uk, using ftp.

5 Files locations

Originally transferred to personal folder, but later moved to common data folder: /data/RNAseq_PD/. Files can be found within their respective project files tissue_polyA_samples/raw_daw and nuclear_totalRNA_samples/raw_daw
Note that R2 files relate to UMIs, therefore were moved to their UMIfolder within the raw_data folders.

6 MD5 checksum

When transferring files, it is important to check that the transferred files match the original files i.e. no files have been corrupted during the file transfer. One way to do this is to perform an md5 checksum (https://www.lifewire.com/validate-md5-checksum-file-4037391).
R has a function that also allows users to check md5 sums for a list of files (https://stat.ethz.ch/R-manual/R-devel/library/tools/html/md5sum.html). This can be adapted within a custom function that will allow comparison of md5 checksums prior to and post transfer.

source(here::here("R", "md5_checksum.R"))

file_paths <- list.files(path = "/data/RNAseq_PD", full.names = TRUE, pattern = ".fastq.gz", recursive = T)

original_md5 <- read_delim(file = "/data/RNAseq_PD/nuclear_totalRNA_samples/raw_data/md5sums_davros.txt", delim = " ", col_names = FALSE) %>% 
  dplyr::mutate(original_md5 = X1, file_name = X2) %>% 
  dplyr::select(-X1, -X2) %>% 
  bind_rows(read_delim(file = "/data/RNAseq_PD/tissue_polyA_samples/raw_data/md5sums_davros.txt", delim = " ", col_names = FALSE) %>% 
              dplyr::mutate(original_md5 = X1, file_name = X2) %>% 
              dplyr::select(-X1, -X2)) %>% 
  dplyr::mutate(file_name = str_replace(file_name, ".*/", ""),
                file_name = str_replace(file_name, "/.*/", ""),
                file_name = str_replace(file_name, " ", ""))
  
md5 <- md5_checksum(file_paths, original_md5, column_to_join_by = "file_name")

write_csv(md5, path = "/home/rreynolds/projects/Aim2_PDsequencing_wd/results/md5_check.csv")

All files were transferred without evidence of corruption, as is visible from table below.

7 File name check

# Sequencing files
file_paths <- list.files(
  path = file.path(path_to_bulk_seq_data, "QC/fastp"), 
  full.names = TRUE, 
  pattern = ".fastq.gz", 
  recursive = T
  )

# Filter out "Undetermined" files and extract unique file names
samples_received <- 
  file_paths %>% 
  .[!str_detect(.,"Undetermined")] %>% 
  str_remove("/.*/") %>% 
  str_remove("-T.*") %>% 
  str_replace("^[:alnum:]*", "") %>% 
  str_remove("^_") %>% 
  unique()

# Does this match sample names sent to sequencing?
samples_sent <-
  read_delim(
    file = file.path(
      path_to_raw_data, "sample_details/SamplesSentToSequencing.txt"), 
    delim = "\t", 
    col_names = FALSE
    ) %>% 
  .[["X1"]]

samples_sent[!str_detect(samples_sent, "BulkNuc")] %in% samples_received

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

8 Conclusions

Data transferred with no corrupt files.
All samples sent for sequencing received from sequencing centre.

Transferring bulk RNA-seq

Regina H. Reynolds

28/6/2019

1 Aims

2 File paths for workflow

3 Transferring data

4 Transferring data 2.0

5 Files locations

6 MD5 checksum

7 File name check

8 Conclusions