Slurm Basics

This page introduces basic Slurm commands to help new users get started with job scheduling in high-performance computing (HPC) environments. Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters of all sizes. It is used to allocate compute resources, schedule jobs, and manage workloads efficiently on shared systems.

Dartmouth Discovery Cluster

Check out Dartmouth Discovery's documentation on the Slurm environment.

Basic Shell Commands¶

In this section, we'll explore essential command-line interface (CLI) commands that help is interact with the slurm workload manager.

Command	Description	Example
`squeue`	View the status of jobs in the Slurm job queue.	`squeue`
	Shows an estimate of when pending jobs are expected to start.	`squeue --start`
	Filter slurm commands by USER_ID (replace with username).	`squeue -u USER_ID`
	Filter slurm commands by USER_ID (replace with username).	`squeue -u USER_ID`
`watch`	Continuously runs a command or queuery (here every second).	`watch -n1 squeue -u USER_ID`
`sbatch`	Submits a job script (specify the name) to the Slurm queue.	`sbatch JOB_SCRIPT`
`scancel`	Cancel a job from the slurm queue.	`scancel JOB_ID`

Placeholders like USER_ID, JOB_SCRIPT, and JOB_ID have to be specified for each job and user.

Understanding Queue Output¶

The output of squeue provides key details about jobs in the queue. This is what each column means:

Column	Description
`JOBID`	Unique identifier for the job.
`PARTITION`	Queue/partition where the job is submitted.
`NAME`	Name of the job (from the job script or `--job-name`).
`USER`	User who submitted the job.
`ST`	Current job status (e.g., `R` for running, `PD` for pending, `CG` for completing).
`TIME`	Time the job has been running (or pending).
`NODES`	Number of nodes requested or assigned.
`NODELIST(REASON)`	Node(s) the job is running on or reason why it is pending.

Dartmouth Discovery Cluster

Typical reasons for wait times in the queue include:

QOSGrpGRES: Waiting for specific resources (e.g., GPUs) in the requested Quality of Service group.
PartitionTimeLimit: Job exceeds the time limit set for the partition or resources are freeing up.
JobArrayTaskLimit: Reached the limit on the number of concurrent jobs in a job array.
Priority: Jobs with lower priority wait behind those with higher priority based on scheduling policies.

Commonly Used `sbatch` Submit Script Commands¶

These are explanations for the most frequently used Slurm sbatch submit script commands. They are used in Slurm job scripts to configure resource requests and job parameters. Adjusting these options ensures your job runs efficiently on a cluster.

Command	Description	Example
`#SBATCH -J JOB_NAME`	Sets the job's name to `JOB_NAME`, useful for identification and tracking.	`#SBATCH -J my_job`
`#SBATCH --partition JOB_PARTITION`	Specifies the partition `JOB_PARTITION` where the job should run.	`#SBATCH --partition gpuq`
`#SBATCH --gres=gpu:N`	Requests N GPUs for each node.	`#SBATCH --gres=gpu:1`
`#SBATCH -o FILE_NAME`	Redirects standard output to a file (`log.out`) in the specified directory.	`#SBATCH -o ./log.out`
`#SBATCH -e FILE_NAME`	Redirects standard error to a file (`log.err`) in the specified directoty.	`#SBATCH -e ./log.err`
`#SBATCH --nodes=N`	Requests N compute nodes for the job.	`#SBATCH --nodes=1`
`#SBATCH --ntasks-per-node=N`	Specifies N tasks (MPI processes) per node, set to one for serial jobs.	`#SBATCH --ntasks-per-node=1`
`#SBATCH --cpus-per-task=N`	Requests N CPU cores per task.	`#SBATCH --cpus-per-task=1`
`#SBATCH --time=HH:MM:SS`	Sets a time limit for the job.	`#SBATCH --time=00:20:00`
`#SBATCH --account=ACCT`	Requests a specific account/allocation to run the job.	`#SBATCH --account=epaco`

Slurm Basics

Basic Shell Commands¶

Understanding Queue Output¶

Commonly Used sbatch Submit Script Commands¶

Commonly Used `sbatch` Submit Script Commands¶