Moving Source Code and Data#

We should write and develop our code and work with smaller datasets on our local machines. HPC clusters are the “big guns” for holding our large datasets and running our code at scale on high-performance hardware and accelerators.

Note

Your ~ (home) directory is your primary workspace used for small, temporary experimental scripts.
Your ~/projects directory is used for long-term research projects.
$SCRATCH is an additional workspace offering additional but temporary storage.

Moving Source Code#

There are several ways to transfer your work to the cluster. For source code, we recommend using GitHub.

SCP (Secure Copy)#

For one-off scripts, you can simply use the scp command to copy files directly from your local machine to the cluster.

From your local terminal, run:

scp -r /path/to/local/folder <your_username>@vulcan:~/<destination_folder>

Note

This method can be very slow if your project contains thousands of small files. In those cases, using Github or archiving the folder into a .tar.gz file before transferring is highly recommended.

Moving Data#

AI/ML/RL workloads consume a lot of data. Let’s get your data on to the cluster so your models can access it. There’s a couple options: scp / rsync or Globus. For this documentation we’ll assume you’re transferring data on your local machine to Vulcan.

  • Server: vulcan.alliancecan.ca

  • Path: ~/projects/<project_name>/data

Using SCP (Secure Copy)#

scp is a straightforward way to copy files over an SSH connection.

scp /path/to/local/file.txt <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data

Include the -r flag to copy directories recursively.

Using Rsync (Remote Sync)#

rsync is the recommended tool for large transfers with many files. It can also resume interrupted transfers, preserve file permissions, and compress data during the journey, making it faster and more robust than scp for large datasets. Furthermore, rsync checks for files that have already been transferred and only transfers the difference.

  • To sync a directory:

    rsync -avz /path/to/local/folder/ <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data
    

Note

Adding a trailing slash to the source folder copies the contents of the folder rather than the folder itself. If you’re copying over compressed files (.tar.gz, .zip, etc), remove the -z flag since the source is already compressed.

Handling Many Small Files#

If you have thousands of small files (like a dataset of small images), transferring them individually is extremely slow due to the overhead of opening and closing connections for each file. It is much more efficient to bundle & compress them first.

  1. Compress locally & transfer:

    tar -czvf data_archive.tar.gz /path/to/local/folder
    rsync -avP data_archive.tar.gz <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data
    
  2. Uncompress on the cluster:

    cd ~/projects/<project_name>/data && tar -xzvf data_archive.tar.gz
    

Using Globus#

Globus is a service for transferring data from your personal machine to a cluster or between clusters designed with researchers in mind. There is a web interface, a command-line tool, and a desktop application for transferring local files to a cluster. You can also share files with other researchers using Globus.

If you already have a CCDB account on an Alliance cluster, you already have a built-in license to use Globus.

Below is a sneak peek of the Globus file transfer interface:

../_images/globus.png