Launching Parallel Job Steps on Clusters

Launching Parallel Job Steps on Clusters#

Note

This section contains multiple example job scripts. You can download the companion examples from GitHub to your home (~) directory and run:

git clone https://github.com/Amii-Engineering/amii-doc-examples.git
cd amii-doc-examples/src/srun

`srun`#

While sbatch is used to submit jobs to a queue for later execution, srun is the primary tool for launching and managing parallel tasks on a cluster. The complete srun manual can be found here.

The core concept introduced by srun is the Job Step. In Slurm, a job step runs parallel tasks on reserved nodes within an existing job resource allocation.

A Job reserves resources from the cluster (Nodes, CPUs, GPUs, Memory, Time).
A Job Step (invoked via srun) puts those allocated resources to work.

The figure below demonstrates the relationship between a job and its constituent job steps.

Diagram showing the hierarchy of a Slurm Job containing multiple Job Steps.

Key Characteristics of Job Steps#

Direct Invocation Every time you call srun, you initiate a new job step.
Resource Subsetting A step can utilize all resources in your job or just a small portion, but it cannot exceed the total job allocation. This allows for complex workflows where you might run one step on nodes 1–5 and a simultaneous step on nodes 6–10.
Important

The job allocation must also account for the resources used by the job script itself. You can verify this behavior by running the following command:
```
sbatch test-script-memory-usage.sh
```
The script’s overhead plus the task running on the same node exceeds the limit and the job will fail with an OOM (Out Of Memory) error. Thus, you should avoid running heavy programs in the job script.
Independent Tracking Slurm tracks each step individually. When using monitoring tools like sacct, you will see the main Job ID followed by a step suffix (e.g., 12345.0, 12345.batch).

Common Types of Steps#

When reviewing job history with sacct (to be covered), you will encounter steps generated both manually and automatically:

Step Suffix	Description
`.batch`	The execution of the job script itself (created by `sbatch`).
`.extern`	A special step for administrative and monitoring purposes (e.g., job cleanup).
`.0`, `.1`, etc.	Standard job steps created every time you call `srun`.

`srun` Options#

srun and sbatch share most resource-requesting options. However, their execution models differ: sbatch runs your job script once on the first allocated node, while srun launches multiple instances (tasks) in parallel across the allocation.

Option	Description
`--nodes` / `-N`	Number of physical nodes to use for this step.
`--ntasks` / `-n`	Total number of parallel tasks (processes) to launch.
`--ntasks-per-node`	Maximum number of tasks to be invoked on each allocated node.
`--cpus-per-task` / `-c`	CPU cores allocated to each task (useful for multithreading).
`--mem`	Total memory requested per node.
`--mem-per-cpu`	Memory requested per allocated CPU.
`--gpus` / `-G`	Total GPUs requested for the specific job step.
`--time` / `-t`	Maximum runtime limit (`D-HH:MM:SS`) for the job step.

Tip

Inside a job script, you rarely need to repeat options in srun. It automatically inherits the environment from the #SBATCH directives. Use srun options only when you need to override the job-level defaults for a specific step.

Common `srun` Pitfalls#

Resource Sharing on the Same Node Using per-node options like --mem with srun can be confusing, as resources are allocated differently depending on how tasks are distributed across nodes. Because --mem is enforced on a per-node basis, it can create inconsistencies in the actual resources available to each task.

For example, consider the command srun -n 2 --mem=500M my_script.sh:
- Same Node: If both tasks run on the same node, they share a single 500M memory pool.
- Different Nodes: If the tasks are launched on different nodes, Slurm reserves 500M on each node, giving each task exclusive access to 500M.
This behavior can make it difficult to determine the exact resource allocation for each task and may lead to unexpected failures. You can test this behavior using:
```
sbatch test-mem.sh
```
How to modify the script to avoid OOM?

To ensure each task receives a guaranteed amount of memory regardless of node distribution, use the --mem-per-cpu or --mem-per-gpu flags instead of --mem.
Resource Contention in Parallel Steps: When executing multiple steps simultaneously (using the bash background operator &), you must ensure the sum of resources requested by all background steps fits within the total job allocation. If the combined request exceeds the allocation, subsequent srun calls will either block (wait for available resources) or fail.

Hands-on Exercises:
1. Observe Resource Blocking Run the memory contention script:
```
sbatch test-parallel-step-mem-contention.sh
```
  Examine the output timestamps. You will notice the steps ran sequentially because the job’s memory limit prevented them from starting at the same time.
2. Expand the Resource “Envelope” Open the script and change the allocation from #SBATCH --mem=500M to #SBATCH --mem=1G. Re-submit the job. Note how the timestamps change with a larger memory “envelope,” Slurm can now launch the steps in parallel.
3. CPU and GPU Binding The same logic applies to CPUs and GPUs. Run the following scripts:
```
sbatch test-cpu-binding.sh
sbatch test-parallel-step-gpu-contention.sh
```
  Challenge: Can you identify which specific resource is being over-subscribed in these scripts? Try to modify the srun options so that both steps execute simultaneously.
Shell Redirection and Pipes: Standard piping (e.g., cat data.txt | srun ./app) is often problematic. By default, only Task 0 receives the input. For parallel input, use the --input flag or handle file I/O within your application code.
The “Task Multiplier” Effect: Remember that srun launches N copies of your executable. If your code is not designed for MPI or parallel coordination (e.g., a standard Python script), running srun -n 40 python script.py will result in 40 independent processes likely overwriting each other’s files.
Submitting ``srun`` within a Loop: Using a bash loop to call srun thousands of times creates massive overhead for the Slurm controller (slurmctld) and thus slows down the entire cluster.

Launching Parallel Job Steps on Clusters

Contents