Monitoring and Managing Jobs#

Besides sbatch and srun, Slurm provides a set of commands to monitor and manage your submitted jobs.

squeue#

The squeue command is used to view the status of jobs currently managed by Slurm. It provides a real-time snapshot of:

  • Which jobs are currently running.

  • Which jobs are waiting in the queue (pending).

  • Why a specific job might be delayed (reason codes).

squeue Options#

Option

Description

Example

--user / -u

View jobs for a specific user.

squeue -u jsmith

--jobs / -j

Inspect a specific job ID.

squeue -j 123

--states / -t

Filter by state (e.g., RUNNING, PENDING).

squeue -t R,PD

--long / -l

Report more of the available information for the selected jobs or job steps.

squeue -l

Note

If no user is specified, jobs from all users will be listed. On Vulcan, you can use the alias sq to quickly list your own jobs.

Example Output:

Example of squeue command output

Important squeue Output Columns#

  • ST: Shows the current lifecycle phase of your job:
    • R (Running): The job is currently executing on compute nodes.

    • PD (Pending): The job is waiting for resources (nodes, CPUs, or GPUs).

    • CG (Completing): The job is finishing up and releasing resources.

  • (REASON): Explains why a job is pending:
    • (None): The job is running.

    • Priority: Higher priority jobs are waiting for the same partition.

    • Resources: Waiting for requested nodes/CPUs to become available.

    • Dependency: Waiting for a prerequisite job to finish.

    • BeginTime: The scheduled start time has not yet been reached.

Full documentation: squeue manual

scancel#

If sbatch is the “submit” button and squeue is the “monitor,” then scancel is your “stop” or “undo” button. It is used to terminate jobs or send signals to specific job steps.

  • For Pending Jobs: Removes the job from the queue immediately.

  • For Running Jobs: Sends a signal (default is SIGKILL) to the job processes on the compute nodes.

scancel Options#

Option

Description

Example

--state / -t

Restrict scancel to jobs in a specific state.

scancel -t PENDING

--signal / -s

Send a specific signal (name or number).

scancel -s TERM 123.1

In most cases, simply running scancel <job_id> is sufficient to kill a job.

Full documentation: scancel manual

sacct#

The sacct command displays accounting data for all jobs and job steps that have been submitted to Slurm (including finished jobs). It is useful for:

  • Finding exit codes and reasons for job failure.

  • Checking actual resource usage to tune future requests.

sacct Options#

Option

Description

Example

--jobs / -j

Display info about specific job IDs or steps.

sacct -j 123

--name

Display jobs matching specific names.

sacct --name=test-cpu-binding

--starttime / -S

Select jobs active after a specific time.

sacct -S 2026-01-01

--endtime / -E

Select jobs active before a specific time.

sacct -S 2026-01-01 -E 2026-01-02

--format

Customize output columns (e.g., --format=JobID,JobName,State,MaxRSS).

sacct --format=Jobid,Elapsed,MaxRSS,ExitCode,State

Example Output:

Example of sacct command output

Important sacct Output Columns#

  • CPUTime: Total CPU time used (Elapsed time × CPU count).

  • Elapsed: The actual wall-clock time the job ran.

  • MaxRSS: The maximum resident set size (memory) used by any task in the job.

  • ExitCode: The exit status. The number after the colon indicates the signal that terminated the job.

  • State:
    • CA (CANCELLED): Job was manually cancelled.

    • CD (COMPLETED): Finished successfully (exit code 0).

    • F (FAILED): Terminated with a non-zero exit code.

    • OOM (OUT_OF_MEMORY): Job was killed for exceeding memory limits.

    • TO (TIMEOUT): Job reached its allocated time limit.

Tip

Pay close attention to MaxRSS. Use this value to adjust your future memory requests (--mem) to ensure your jobs have enough resources without wasting system capacity.

Full documentation: sacct manual