SLURM Users

Revision as of 17:37, 15 March 2022 by Wdoyle2 (talk | contribs) (→‎SBATCH Parameters)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Basic Commands & Job Submission

This document will provide information on submitting jobs, removing jobs, checking job status, and interacting with the scheduler.

Key Commands

Command Description
sbatch <script_name> Submit jobs. Refer to the command man sbatch to display detailed explanations for each available option. These options can be added to the command line or to your submit script.
squeue Display the status of your jobs. The man page for qstat will provide detailed explanations for each available option. Useful options include: -u [user_name] to filter a single user; -j [job_id] for detailed job information
sjobets Display the status and estimated start time (ets) of all jobs in the queue. Takes the same options as squeue
scancel <job_id> Delete/stop jobs. Again, the man page will provide further information.
sinfo Display a status summary of the entire cluster (and all partitions). Output can be customized to display total, used, and available cores per partition/node.

Complete command-line options can be found by issuing -help at the end of any of the above commands or by utilizing the manual pages, e.g. run man sbatch

Submitting Jobs

You will need to create a submit script to run your job on the cluster. The submit script will specify what type of hardware you wish to run on, what priority your job has, what to do with console output, etc. We will look at SLURM submit scripts for Serial and Parallel jobs so that you may have a good understanding of how they work and what it is they do.

SBATCH Parameters

#!/bin/bash
#!/bin/bash
#SBATCH --mail-type=ALL, TIME_LIMIT_50
#SBATCH --mail-user=user@usf.edu
#SBATCH --output=/home/u/user/Documents/output.%j
#SBATCH --partition=amd_2021
#SBATCH --qos=preempt_short

--mail-type=: Emails when the specified event occurs. Options include : NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90 , TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple values may be specified by a comma separated list.

--mail-user=user@doamin.com: Sends an email to specified address.

--output=/home/u/user/output.%j: Sends the output to the specified folder with the specified name and job ID.

--partition=name: Specifies the partition to use.

--qos=name: Specifies the QOS to use.

Submission for Serial Jobs

SLURM uses pre-processed shell scripts to submit jobs. SLURM provides predefined variables to help integrate your process with the scheduler and the job dispatcher. It is likely that you will need to pass options to SLURM to retrieve statistical information, set job specifications, redirect your I/O, change your working directory, and possibly notify you of job failure or completion. To set these options, you will need to pass arguments to qsub or you can embed these options in your submit file.

A simple job for SLURM would look like this

date

It is a simple script that calls date on whatever machine SLURM decides to run your job on. Let’s have a look at another submit file that does the same thing:

#!/bin/bash
#SBATCH --job-name=get_date
#SBATCH --time=00:30:00

date

An overview of the options (following the character sequence “#SBATCH”) is as follows:

jobname=get_date: Set job name for queue (in this example, the job is named “get_date”). You can set this to a job name of your choice.

time=00:30:00: Tell the scheduler that this process should only run for 30 minutes.

* Note: If a time is not specified, a default runtime of 1 hour (01:00:00) will be applied to the job.

These options should be sufficient for the most basic serial jobs.

With this file, we have specified our output file, the name of the job, and that SLURM should execute this job from within the same directory as it was submitted. Let us call this script date.sh and submit the job to the queue:

[user@login0 ~]$ sbatch date.sh
Submitted batch job 40638

Let’s now check the status of our job:

[user@login0 ~]$ squeue -u user
             JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
             40641     circe get_date    user   PD       0:00      1 (None)

You can see job 40641 (as an example) listed in the output. You can see it is in the state PD which means it is waiting to be dispatched pending the arrival of the next scheduler iteration OR when sufficient resources become available.

Submission for Parallel Jobs

Because many of the applications available for science and engineering are highly resource intensive, many of them have been “parallelized” in order to run on multiple pieces of hardware to divide the work load. Most applications have standardized upon the MPI or MPI2 specification. Since many of you will want to run your applications in parallel in order to take advantage of performance gains, you’ll need to know how to create a job script for submitting such an applications. SLURM provides “Parallel Environments” for interfacing your job with an MPI library and distributing it across a cluster.

Rather than explain everything all at once, here is a submit script for a parallel job:

#!/bin/bash
#SBATCH --job-name=parallel-job
#SBATCH --time=08:00:00
#SBATCH --output=output.%j
#SBATCH --ntasks=4

mpirun parallel-executable


Most of the submit directives (remember, those lines starting with “#SBATCH” ?) should already be familiar to you. But you should also notice that we’ve added a few new directives:

ntasks=4: This specifies the number of CPUs you would like allocated for your job.

output=output.%j: This specifies the file that the job will write all output to. It’s extension will be the SLURM job ID number.

Following the exact same steps for a serial job submission script that were described above, we can submit our parallel job script to the cluster using sbatch to view its execution status.

Interactive jobs

Interactive jobs can be run via the srun command, and uses many of the same options that are available to submit scripts. For more information, please see the SLURM Interactive documentation.

Available Environment Variables

The following environment variables are defined by SLURM at run time. They may be referenced in your submit scripts to add functionality to your code:

  • $SLURM_JOB_ID: The job number assigned by the scheduler to your job
  • $SBATCH_ACCOUNT: The username of the person currently running the job
  • $SLURM_JOB_NAME: The job name specified by the “—job-name” option
  • $SLURM_JOB_NODELIST: Name of execution host

For more information, please reference the man page for sbatch.