Amii HPC Documentation

Amii HPC Documentation#

Welcome to Amii’s HPC documentation. This resource provides the core concepts, procedures, and best practices necessary to execute computation-intensive workloads efficiently on High-Performance Computing (HPC) clusters, specifically those managed by Alberta Machine Intelligence Institute (Amii) and its partners.

What this documentation covers#

Access & Setup: Navigating account configuration with Digital Research Alliance of Canada (DRAC).
Architecture: An overview of HPC cluster design and infrastructure.
Workload Management: Mastering Slurm scheduler for job submission and resource allocation.
Optimization: Techniques for monitoring and tuning performance on HPC systems.
Practical Resources: Reusable templates, proven workflows, and examples developed by Amii’s Engineering and Performance team.

Throughout these guides, we utilize Amii’s Vulcan cluster as a primary reference and example environment.

Target Audience#

While primarily designed for researchers and students with access to Amii-managed clusters, this documentation serves as a broad knowledge base. Anyone interested in the following areas will find these resources valuable:

Distributed and parallel computing
Slurm workload management
GPU optimization for AI workflows
DevOps and MLOps integration

Note

A significant portion of this content covers general HPC concepts applicable to systems beyond those managed by Amii.

Table of Contents#

Getting Started with Clusters

Basics of HPC and Slurm

Slurm Examples

Slurm Examples
- Introduction to PyTorch DDP
- Introduction to PyTorch FSDP

Useful Resources

Hardware Specifications
- Vulcan