Moving Source Code and Data

Moving Source Code and Data#

We should write and develop our code and work with smaller datasets on our local machines. HPC clusters are the “big guns” for holding our large datasets and running our code at scale on high-performance hardware and accelerators.

Note

Your ~ (home) directory is your primary workspace used for small, temporary experimental scripts.
Your ~/projects directory is used for long-term research projects.
$SCRATCH is an additional workspace offering additional but temporary storage.

Moving Source Code#

There are several ways to transfer your work to the cluster. For source code, we recommend using GitHub.

GitHub (Recommended)#

The most robust method for managing research projects is using GitHub. GitHub is a cloud-based platform for version control, allowing you to store, manage, and track code changes using Git. If you are new to version control, we highly recommend familiarizing yourself with Git before proceeding. The general idea is to create a repository on GitHub, push your code to it, then pull the code changes on the cluster. This methodology ensures your code is backed up and you’re operating on the latest version of your code.

Once your project is hosted on GitHub, use one of the two methods below to clone it onto the cluster.

SCP (Secure Copy)#

For one-off scripts, you can simply use the scp command to copy files directly from your local machine to the cluster.

From your local terminal, run:

scp -r /path/to/local/folder <your_username>@vulcan:~/<destination_folder>

Note

This method can be very slow if your project contains thousands of small files. In those cases, using Github or archiving the folder into a .tar.gz file before transferring is highly recommended.

Moving Data#

AI/ML/RL workloads consume a lot of data. Let’s get your data on to the cluster so your models can access it. There’s a couple options: scp / rsync or Globus. For this documentation we’ll assume you’re transferring data on your local machine to Vulcan.

Server: vulcan.alliancecan.ca
Path: ~/projects/<project_name>/data

Using SCP (Secure Copy)#

scp is a straightforward way to copy files over an SSH connection.

scp /path/to/local/file.txt <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data

Include the -r flag to copy directories recursively.

Using Rsync (Remote Sync)#

rsync is the recommended tool for large transfers with many files. It can also resume interrupted transfers, preserve file permissions, and compress data during the journey, making it faster and more robust than scp for large datasets. Furthermore, rsync checks for files that have already been transferred and only transfers the difference.

To sync a directory:

rsync -avz /path/to/local/folder/ <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data

Note

Adding a trailing slash to the source folder copies the contents of the folder rather than the folder itself. If you’re copying over compressed files (.tar.gz, .zip, etc), remove the -z flag since the source is already compressed.

Handling Many Small Files#

If you have thousands of small files (like a dataset of small images), transferring them individually is extremely slow due to the overhead of opening and closing connections for each file. It is much more efficient to bundle & compress them first.

Compress locally & transfer:

tar -czvf data_archive.tar.gz /path/to/local/folder
rsync -avP data_archive.tar.gz <username>@vulcan.alliancecan.ca:~/projects/<project_name>/data

Uncompress on the cluster:

cd ~/projects/<project_name>/data && tar -xzvf data_archive.tar.gz

Using Globus#

Globus is a service for transferring data from your personal machine to a cluster or between clusters designed with researchers in mind. There is a web interface, a command-line tool, and a desktop application for transferring local files to a cluster. You can also share files with other researchers using Globus.

If you already have a CCDB account on an Alliance cluster, you already have a built-in license to use Globus.

Login to the Globus portal.
Here are excellent instructions for accessing Globus and transferring files to a cluster.

Below is a sneak peek of the Globus file transfer interface:

Moving Source Code and Data

Contents

Moving Source Code and Data#

Moving Source Code#

GitHub (Recommended)#

SCP (Secure Copy)#

Moving Data#

Using SCP (Secure Copy)#

Using Rsync (Remote Sync)#

Handling Many Small Files#

Using Globus#