Aim: transfer of data to EGA
source(here::here("R", "file_paths.R"))
path_for_tables <- file.path(path_to_wd, "paper_draft/tables")
path_to_encrypted_files <- file.path(path_to_bulk_seq_data, "QC/fastp_encrypted")
We will be uploading fastqs (https://ega-archive.org/submission/sequence) that have had adapters removed and have been through QC (as per this script).
Requirements include:
cd /home/rreynolds/tools/
wget https://ega-archive.org/files/EgaCryptor.zip
unzip EgaCryptor.zip
cd /data/RNAseq_PD/tissue_polyA_samples/QC
mkdir fastp_encrypted
cp /data/RNAseq_PD/tissue_polyA_samples/QC/fastp/NM*.fastq.gz /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/
# -i flag specifies input, while -o flag specifies output folder
java -jar /home/rreynolds/tools/EGA-Cryptor-2.0.0/ega-cryptor-2.0.0.jar -i /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/ -o /data/RNAseq_PD/tissue_polyA_samples/QC/fastp_encrypted/
-p
flag. (https://stackoverflow.com/questions/19516263/200-port-command-successful-consider-using-pasv-425-failed-to-establish-connec.# Enable passive
ftp -p ftp.ega.ebi.ac.uk
Enter your submission username and submission password.
Type binary
to enter binary mode for transfer (to see a list of available ftp commands type help
). Type prompt
to switch off confirmation for each file uploaded. Use mput
command to upload files.
binary
prompt
mput *
bye
command to exit the ftp client.Followed all instructions here: https://ega-archive.org/submission/tools/submitter-portal.
Samples must be registered with the following information:
Field with an asterisk must be filled in. These fields can be populated using a .csv. Thus, used the sample info we have to create this .csv.
sample_info <-
readr::read_csv(
file.path(path_for_tables, "/", "sample_info_bulk_seq_metrics.csv")
)
ega_sample_info <-
sample_info %>%
dplyr::filter(sent_to_bulk_seq == "yes") %>%
dplyr::mutate(
title = "",
alias = sample_id,
description = "",
subjectId = sample_id,
bioSample = "",
caseOrControl =
case_when(
disease_group == "Control" ~ "control",
TRUE ~ "case"
),
gender =
case_when(
Sex == "F" ~ "female",
Sex == "M" ~ "male"
),
organismPart = "anterior cingulate cortex",
cellLine = "",
region = "",
phenotype =
case_when(
disease_group == "PD" ~ "Parkinson's disease",
disease_group == "PDD" ~ "Parkinson's disease with dementia",
disease_group == "DLB" ~ "Dementia with Lewy bodies",
TRUE ~ disease_group
)
) %>%
dplyr::select(
title, alias, description, subjectId, bioSample, caseOrControl, gender, organismPart, cellLine, region, phenotype
)
ega_sample_info
readr::write_csv(
ega_sample_info,
file.path(path_to_results, "ega_sample_info.csv")
)
Samples and files have to be linked. Fields can be populated using a .csv with the following column names:
file_df <-
tibble(
file_name = list.files(path_to_encrypted_files),
file_type = list.files(path_to_encrypted_files) %>%
str_remove(".*.fastq."),
read = case_when(
str_detect(list.files(path_to_encrypted_files), "R1") ~ "R1",
str_detect(list.files(path_to_encrypted_files), "R3") ~ "R2"
),
sample_id = list.files(path_to_encrypted_files) %>%
str_remove("NM...._") %>%
str_remove("_.*")
) %>%
tidyr::pivot_wider(
names_from = c(file_type, read),
values_from = file_name
)
sample_file_link <-
tibble(.rows = nrow(file_df)) %>%
dplyr::mutate(
`Sample alias` = file_df$sample_id,
`First Fastq File` = file_df$gz.gpg_R1,
`First Checksum` = "",
`First Unencrypted checksum` = "",
`Second Fastq File` = file_df$gz.gpg_R2,
`Second Checksum` = "",
`Second Unencrypted checksum` = ""
)
for(i in 1:nrow(file_df)){
sample_file_link[i, ] <-
sample_file_link %>%
dplyr::slice(i) %>%
dplyr::mutate(
`First Checksum` =
read_lines(
file.path(path_to_encrypted_files, file_df$gz.gpg.md5_R1[i])
),
`First Unencrypted checksum` =
read_lines(
file.path(path_to_encrypted_files, file_df$gz.md5_R1[i])
),
`Second Checksum` =
read_lines(
file.path(path_to_encrypted_files, file_df$gz.gpg.md5_R2[i])
),
`Second Unencrypted checksum` =
read_lines(
file.path(path_to_encrypted_files, file_df$gz.md5_R2[i])
)
)
}
sample_file_link
readr::write_csv(
sample_file_link,
file.path(path_to_results, "ega_sample_file_linkage.csv")
)
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.1 (2019-07-05)
## os Ubuntu 16.04.6 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/London
## date 2021-05-14
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.1)
## backports 1.1.8 2020-06-17 [1] CRAN (R 3.6.1)
## blob 1.2.1 2020-01-20 [1] CRAN (R 3.6.1)
## bookdown 0.21 2020-10-13 [1] CRAN (R 3.6.1)
## broom 0.7.0 2020-07-09 [1] CRAN (R 3.6.1)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 3.6.1)
## cli 2.2.0.9000 2021-01-22 [1] Github (r-lib/cli@4b41c51)
## colorspace 2.0-0 2020-11-11 [2] CRAN (R 3.6.1)
## crayon 1.4.1 2021-02-08 [2] CRAN (R 3.6.1)
## DBI 1.1.1 2021-01-15 [2] CRAN (R 3.6.1)
## dbplyr 1.4.4 2020-05-27 [1] CRAN (R 3.6.1)
## digest 0.6.27 2020-10-24 [2] CRAN (R 3.6.1)
## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 3.6.1)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.1)
## evaluate 0.14 2019-05-28 [2] CRAN (R 3.6.1)
## forcats * 0.5.1 2021-01-27 [2] CRAN (R 3.6.1)
## fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.1)
## generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1)
## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 3.6.1)
## glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.1)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 3.6.1)
## haven 2.3.1 2020-06-01 [1] CRAN (R 3.6.1)
## here 1.0.0 2020-11-15 [1] CRAN (R 3.6.1)
## hms 1.0.0 2021-01-13 [2] CRAN (R 3.6.1)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 3.6.1)
## httr 1.4.2 2020-07-20 [1] CRAN (R 3.6.1)
## jsonlite 1.7.1 2020-09-07 [1] CRAN (R 3.6.1)
## knitr 1.29 2020-06-23 [1] CRAN (R 3.6.1)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.1)
## lubridate 1.7.9 2020-06-08 [1] CRAN (R 3.6.1)
## magrittr 2.0.1 2020-11-17 [2] CRAN (R 3.6.1)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 3.6.1)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 3.6.1)
## pillar 1.4.6 2020-07-10 [1] CRAN (R 3.6.1)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 3.6.1)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.1)
## R6 2.5.0 2020-10-28 [2] CRAN (R 3.6.1)
## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 3.6.1)
## readr * 1.4.0 2020-10-05 [2] CRAN (R 3.6.1)
## readxl 1.3.1 2019-03-13 [2] CRAN (R 3.6.1)
## reprex 1.0.0 2021-01-27 [2] CRAN (R 3.6.1)
## rlang 0.4.7 2020-07-09 [1] CRAN (R 3.6.1)
## rmarkdown 2.5 2020-10-21 [1] CRAN (R 3.6.1)
## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.1)
## rstudioapi 0.13 2020-11-12 [2] CRAN (R 3.6.1)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 3.6.1)
## scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.1)
## sessioninfo * 1.1.1 2018-11-05 [2] CRAN (R 3.6.1)
## stringi 1.5.3 2020-09-09 [2] CRAN (R 3.6.1)
## stringr * 1.4.0 2019-02-10 [2] CRAN (R 3.6.1)
## tibble * 3.0.3 2020-07-10 [1] CRAN (R 3.6.1)
## tidyr * 1.1.1 2020-07-31 [1] CRAN (R 3.6.1)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.1)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.1)
## vctrs 0.3.2 2020-07-15 [1] CRAN (R 3.6.1)
## withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.1)
## xfun 0.16 2020-07-24 [1] CRAN (R 3.6.1)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.1)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.1)
##
## [1] /home/rreynolds/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library