Skip to content

Slurm Basics

This page introduces basic Slurm commands to help new users get started with job scheduling in high-performance computing (HPC) environments. Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters of all sizes. It is used to allocate compute resources, schedule jobs, and manage workloads efficiently on shared systems.

Dartmouth Discovery Cluster

Check out Dartmouth Discovery's documentation on the Slurm environment.

Basic Shell Commands

In this section, we'll explore essential command-line interface (CLI) commands that help is interact with the slurm workload manager.

Command Description Example
squeue View the status of jobs in the Slurm job queue. squeue
Shows an estimate of when pending jobs are expected to start. squeue --start
Filter slurm commands by USER_ID (replace with username). squeue -u USER_ID
Filter slurm commands by USER_ID (replace with username). squeue -u USER_ID
watch Continuously runs a command or queuery (here every second). watch -n1 squeue -u USER_ID
sbatch Submits a job script (specify the name) to the Slurm queue. sbatch JOB_SCRIPT
scancel Cancel a job from the slurm queue. scancel JOB_ID

Placeholders like USER_ID, JOB_SCRIPT, and JOB_ID have to be specified for each job and user.

Understanding Queue Output

The output of squeue provides key details about jobs in the queue. This is what each column means:

Column Description
JOBID Unique identifier for the job.
PARTITION Queue/partition where the job is submitted.
NAME Name of the job (from the job script or --job-name).
USER User who submitted the job.
ST Current job status (e.g., R for running, PD for pending, CG for completing).
TIME Time the job has been running (or pending).
NODES Number of nodes requested or assigned.
NODELIST(REASON) Node(s) the job is running on or reason why it is pending.

Dartmouth Discovery Cluster

Typical reasons for wait times in the queue include:

  • QOSGrpGRES: Waiting for specific resources (e.g., GPUs) in the requested Quality of Service group.
  • PartitionTimeLimit: Job exceeds the time limit set for the partition or resources are freeing up.
  • JobArrayTaskLimit: Reached the limit on the number of concurrent jobs in a job array.
  • Priority: Jobs with lower priority wait behind those with higher priority based on scheduling policies.

Commonly Used sbatch Submit Script Commands

These are explanations for the most frequently used Slurm sbatch submit script commands. They are used in Slurm job scripts to configure resource requests and job parameters. Adjusting these options ensures your job runs efficiently on a cluster.

Command Description Example
#SBATCH -J JOB_NAME Sets the job's name to JOB_NAME, useful for identification and tracking. #SBATCH -J my_job
#SBATCH --partition JOB_PARTITION Specifies the partition JOB_PARTITION where the job should run. #SBATCH --partition gpuq
#SBATCH --gres=gpu:N Requests N GPUs for each node. #SBATCH --gres=gpu:1
#SBATCH -o FILE_NAME Redirects standard output to a file (log.out) in the specified directory. #SBATCH -o ./log.out
#SBATCH -e FILE_NAME Redirects standard error to a file (log.err) in the specified directoty. #SBATCH -e ./log.err
#SBATCH --nodes=N Requests N compute nodes for the job. #SBATCH --nodes=1
#SBATCH --ntasks-per-node=N Specifies N tasks (MPI processes) per node, set to one for serial jobs. #SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=N Requests N CPU cores per task. #SBATCH --cpus-per-task=1
#SBATCH --time=HH:MM:SS Sets a time limit for the job. #SBATCH --time=00:20:00