Introduction to HPC#

To utilize cluster resources effectively, it is essential to understand the fundamentals of High-Performance Computing (HPC). Blindly copying and pasting scripts—whether generated by AI or shared by peers—often leads to inefficient resource usage, which can result in:

  • Longer queue wait times for you and the community.

  • Increased job execution times due to resource bottlenecks.

  • Reduced availability of shared hardware.

  • Lower overall cluster utilization.

  • Missed paper deadlines (a situation best avoided!).

By investing a small amount of time to learn these basics, you will significantly improve your workflow and save countless hours throughout your research journey.

Do not be intimidated by the term HPC. At its simplest, an HPC cluster is a collection of hundreds or thousands of individual computers (called nodes) interconnected via a high-speed network.

Each node is fundamentally similar to a high-end personal computer. If you connected multiple PCs together and installed a “Workload Manager” (like Slurm), you would have a basic cluster.

The picture below shows the conceptual architecture of an HPC cluster.

../_images/hpc-arch.jpg

Login Nodes#

When you connect to the cluster via SSH, you land on a Login Node. This environment is designed for administrative tasks: managing files, editing source code, and submitting jobs to the scheduler.

Important

Never run heavy computations on a login node. Because these nodes are shared by all active users, running a resource-intensive script is like blocking the front door to a building. It slows down the entire system for everyone.

Compute Nodes#

These are the “workhorses” where your actual code executes. They are typically “headless” (no monitors or mice) to maximize efficiency and density. Clusters often categorize compute nodes by their hardware strengths:

  • CPU Nodes: Optimized for general logic, serial processing, and heavy mathematical throughput that is not matrix multiplications (e.g., most AI training).

  • GPU Nodes: Specialized for massive parallel tasks, such as AI/Deep Learning training or molecular dynamics simulations.

  • High-Memory Nodes: Designed for data-intensive tasks requiring hundreds of gigabytes (or terabytes) of RAM.

Note

The Vulcan cluster consists exclusively of GPU nodes, making it highly optimized for AI training. If your program does not utilize GPU acceleration, a CPU-only or heterogeneous cluster like Fir, Nibi, or Roqual may be more appropriate for your needs.

Shared Storage#

HPC systems utilize a Parallel File System. Unlike your laptop’s internal drive, this storage is mounted across every node simultaneously.

On clusters, choosing the right directory is critical for both performance and data safety:

Home Directory (~)#

  • Purpose: Storing source code, configuration files, and job scripts.

  • Characteristics: Relatively small capacity; backed up via a tape system. It is NOT intended for large-scale data I/O.

Project Directory#

  • Purpose: Sharing data among research group members.

  • Characteristics: Significantly larger than home; also backed up via tape. Because backups are expensive, data stored here should remain relatively static.

Scratch Directory#

  • Purpose: High-performance I/O for large datasets and active job files.

  • Characteristics: Large and fast. However, it is NOT backed up, and old files are regularly purged to clear space.

Important

Do not treat scratch as long-term storage. Move important results to your project or home directory promptly to prevent data loss.

SLURM_TMPDIR#

  • Purpose: The fastest possible storage for temporary files needed only during a job’s execution.

  • Characteristics: This environment variable points to a local drive physically located inside each compute node. Because it is local (not networked), it offers the lowest latency and highest speed.

  • Lifecycle: The directory and its contents are automatically deleted when the job ends.

Warning

The space in SLURM_TMPDIR is shared with all other jobs running on that specific node. As it is currently unquotaed, writes may fail if the physical disk reaches capacity.

Storage quotas for Vulcan and other clusters are detailed in the Filesystem quotas and policies section of this page.

Note

A note on ``Vulcan`` Storage Performance: To simplify your workflow, home, project, and scratch on Vulcan all reside on high-speed NVMe disks. However, to maintain cost-efficiency, files not accessed for over 2 weeks are automatically moved to slower spinning disks. The first time you access “cold” data, you may experience a slight delay while the system moves the files back to the NVMe disks.

The Workload Manager#

Beyond physical hardware, the workload manager is the most critical software component of a high-performance cluster.

A workload manager (also known as a job scheduler) is a specialized system designed to orchestrate the shared use of a cluster’s computing resources. It acts as the “brain” of the cluster, ensuring that thousands of individual nodes function as a single, cohesive unit.

The manager performs several core functions:

Resource Allocation#

The manager maintains a real-time inventory of every available CPU core, GPU, and byte of RAM across the cluster. When you submit a job request, it ensures the specific resources you require are physically available and reserved exclusively for your use.

Job Scheduling#

When you submit a task, it enters a global queue. Rather than a simple “first-come, first-served” approach, the workload manager uses complex algorithms (often called Fair Share) to prioritize jobs. It balances several factors, including:

  • How long the job has been waiting in the queue.

  • The size of the request (number of nodes/GPUs).

  • Your recent resource usage compared to other users.

Execution & Reporting#

Once resources are available, the manager “launches” your script onto the assigned compute nodes. During execution, it:

  • Monitors the health and status of your job.

  • Enforces “Walltime” (terminating jobs that exceed their requested time limit to prevent hangs).

  • Logs detailed accounting data such as CPU cycles, peak memory, and energy consumption—for future optimization and reporting.

Most modern HPC clusters, including Vulcan, utilize Slurm.

Developed by SchedMD, now owned by Nvidia, Slurm is the industry-standard, open-source workload manager. It is currently widely used by world’s TOP500 supercomputers due to its massive scalability and efficiency.

In the next section, we will dive into practical Slurm usage to help you efficiently harness the full power of the Vulcan cluster!