Data transfer

Important

The transfer of data from or to the Discoverer login node is viewed in this document as a procedure for downloading or uploading files containing data required or generated by the hosted projects.

Important information regarding the file transfer speed

The Discoverer Petascale Supercomputer is connected to the Internet through two independent high-speed fibre optic lines. The theoretical maximum upload/download speed from or to Discoverer login node can reach peaks of about 5-6 Gbps. Nonetheless, it is important to take note that the actual speed may fluctuate due to factors that are beyond the purview and scope of the Discoverer network operation team’s control. For instance, it is possible that your local internet connection is extensively utilized, resulting in slowing down the file transfers initiated by your computer system over the Internet. It might slow down your file transfers even if your local connectivity to the Internet is not utilized, if too many people transfer files at once from or to the Discoverer login node.

Rsync as a preferred tool for file transfer

Note

Rsync is a file transfer manager, native to most Linux distributions, UNIX, and macOS. It was originally developed by the Samba project to facilitate enhanced file/folder transfer based on synchronization of file lists. Rsync has the capability to maintain a comprehensive list of files by exclusively transferring the latest versions of the listed files. It lets you choose whether to use a remote shell as a pipe for file transfer. There is a setting that can control the speed of the file transfers.

Installation

Rsync is available for Linux, UNIX, and macOS only. If you intend to run Rsync under Microsoft Windows, the easiest way is to install cwRsync:

https://itefix.net/cwrsync

An alternative method to bring Rsync functionality to Windows is to install your favourable Linux distribution on Windows using WSL:

https://learn.microsoft.com/en-us/windows/wsl/install

or VirtualBox:

https://www.virtualbox.org/wiki/Downloads

and then install the package rsync on the running Linux distribution.

Note

In case you do not want to install and use Rsync, try WinSCP:

https://winscp.net/eng/docs/task_synchronize

Preconditions

Since Rsync uses SSH to pipe over the file transfer from/to the Discoverer login node, a VPN tunnel to the Discoverer IP network has to be set up and running. Those of the users, whose computer systems are directly connected to the networks of the Bulgarian academic organizations, do not need to establish a VPN tunnel (see doc:ssh_logging_in_bg_acad).

Common cases

  • Content synchronisation

Transferring the content of a folder from the Discoverer login node onto the local disk:

rsync -e "ssh" -varl --progress --append username@login.discoverer.bg:/discofs/username/folder ~/

rsync -e "ssh -p 2222" -varl --progress --append username@login.bg.discoverer.bg:/discofs/username/folder ~/

Transferring the content of a folder from the local disk onto the Discoverer login node:

rsync -e "ssh" -vrtl --progress --append ~/folder username@login.discoverer.bg:/discofs/username/

rsync -e "ssh -p 2222" -vrtl --progress --append ~/folder username@login.bg.discoverer.bg:/discofs/username/
  • Role of the trailing slashes during synchronisation

It is essential that you consider the role of the “trailing slash” separators in the source and target paths passed to the rsync command line.

Rsync assumes that if the source or target path ends with ‘/’, then all files under that path will take part in the synchronization process, but the source directory name will not be created at the target. For instance, this command line:

rsync -e "ssh" -varl --progress --append username@login.discoverer.bg:/discofs/username/directory/ /home/username/

will transfer all files and directories, including the hidden files, located under /discofs/username/directory/ on the Discoverer login node to the /home/username/ folder on the local system. Thus, even it can be stated that this is a 1:1 replication, the name of the source directory, namely “directory”, will not appear under /home/username/. Only the content of “directory” will be transferred. If we want the name of the source directory to appear on the target system, then the trailing slash after the source directory name has to be omitted (mind the missing trailing slash at the end of the source path):

rsync -e "ssh" -varl --progress --append username@login.discoverer.bg:/discofs/username/directory /home/username/

As a result, the /discofs/username/directory, along with its content, will be transferred/synchronised to the local system (the target) as /home/username/directory.

  • Clean up the removed content during synchronisation

Whenever files or directories are removed at the source, the process should be propagated to the target at some point. To achieve that, the --delete-after option should be added to the command line that invokes rsync:

rsync -e "ssh" -varl --progress --append --delete-after username@login.discoverer.bg:/discofs/username/directory /home/username/

Avoid using scp and sftp

Warning

Since scp and sftp do not provide a mechanism to resume the transfer that has been previously interrupted, we do not recommend their use for data transfer. Furthermore, they do not provide support for incremental file transfer, as opposed to what rsync does. Our support team won’t give priority to any issues with those tools.