You are here: Home / Systems / Mistral / Running Jobs / SLURM Introduction

SLURM Introduction

This page serves as an overview of user commands provided by SLURM and how users should use SLURM in order to run jobs on Mistral. A concise cheat sheet for SLURM can be downloaded here. A comparison to LoadLeveler commands can be found here.

SLURM Commands

SLURM offers a variety of user commands for all the necessary actions concerning the jobs. With these commands the users have a rich interface to allocate resources, query job status, control jobs, manage accounting information and to simplify their work with some utility commands. For examples how to use these command, see section SLURM Command Examples.

  • sinfo show information about all partitions and nodes managed by SLURM as well as about general system state. It has a wide variety of filtering, sorting, and formatting options.
  • squeue query the list of pending and running jobs. By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority. The most relevant job states are running (R), pending (PD), completing (CG), completed (CD) and cancelled (CA). The TIME field shows the actual job execution time. The NODELIST (REASON) field indicates on which nodes the job is running or the reason why the job is pending. Typical reasons for pending jobs are waiting for resources to become available (Resources) and queuing behind a job with higher priority (Priority).
  • sbatch submit a batch script. The script will be executed on the first node of the allocation. The working directory coincides with the working directory of the sbatch directory. Within the script one or multiple srun commands can be used to create job steps and execute parallel applications.
  • scancel cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • salloc request interactive jobs/allocations. When the job is started a shell (or other program specified on the command line) is started on the submission host (login node). From this shell you should use srun to interactively start a parallel applications. The allocation is released when the user exits the shell.
  • srun initiate parallel job steps within a job or start an interactive job.
  • scontrol (primarily used by the administrators) provides some functionality for the users to manage jobs or get some information about the system configuration such as nodes, partitions, jobs, and configurations.
  • sstat query near-realtime status information related to CPU, task, node, RSS and virtual memory for
    a running job.
  • sacct retrieve accounting information about jobs and job steps. For completed jobs sacct queries the accounting database.
  • sacctmgr (primarily used by the administrators) query information about accounts and other
    accounting information.

Allocating Resources with SLURM

A job allocation, which is a set of computing resources (nodes or cores) assigned to a user's request for a specified amount of time, can be created using the SLURM salloc, sbatch or srun commands. The salloc and sbatch commands make resource allocations only. The srun command launches parallel tasks and implicitly create a resource allocation if not started within one.

The usual way to allocate resources and execute a job on Mistral is to write a batch script and submit them to SLURM with the sbatch command. The batch script is a shell script consisting of two parts: resources requests and job steps. Resources requests are specifications for number of nodes needed to execute the job, time duration of the job etc. Job steps are user's tasks that must be executed. The resources requests and other SLURM submission options are prefixed by '#SBATCH ' directives and must precede any executable commands in the batch script. For example:

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=xz0123
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --time=00:30:00

# Begin of section with executable commands
set -e
ls -l
srun ./my_program

The script itself is regarded by SLURM as the first job step and is (serially) executed on the first compute node in the job allocation. To execute parallel (MPI) tasks users call SLURM srun command within the script. Thereby, a new job step is initiated. It is possible to execute parallel programs in the form of job steps in any configuration within the job allocation. This means, a job step can use all allocated resources or several job steps (created via multiple srun calls) can use a subset of allocated resources.

The following table describes the most common or required allocation and submission options that can be defined in a batch script (long and short options are listed):

sbatch optiondefault valuedescription
--nodes=<number>
-N <number>
1 Number of nodes for the allocation
--ntasks=<number>
-n <number>
1 Number of tasks (MPI processes). Can be omitted if --nodes and --ntasks-per-node are given
--ntasks-per-node=<num>
1 Number of tasks per node. If keyword omitted the default value is used, but there are still 48 CPUs available per node for current allocation (if not shared)
--cpus-per-task=<number>
-c <number>
1 Number of threads (logical cores) per task. Used for OpenMP or hybrid jobs
--output=<path>/<file pattern>
-o <path>/<file pattern>
slurm-%j.out
(%j = JobID)
Standard output file
--error=<path>/<file pattern>
-e <path>/<file pattern>
slurm-%j.out
(%j = JobID)
Standard error file
--time=<walltime>
-t <walltime>
partition dependent Requested walltime limit for the job; possible time formats are:
  • [hours:]minutes[:seconds]
    e.g. 20, 01:20, 01:20:30
  • days-hours[:minutes][:seconds]
    e.g. 2-0, 1-5:20, 1-5:20:30
--partition=<name>
-p <name>
none Partition to run the job
--constraint=<list>
-C <list>
none

Node-features requested for the job.
See configuration for available features.

--mail-user=<email>
email given by user when DKRZ account was requested Email address for notifications
--mail-type=<mode>
 - Event types for email notifications
Possible values are BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_[90,80,50])
--job-name=<jobname>
-J <jobname>
job script's name Job name
--account=<project account>
-A <project_account>
 none Project that should be charged
--exclude=<nodelist>
-x <nodelist>
 - Exclude specified nodes from job allocation
--nodelist=<nodelist>
-w <nodelist>
 - Request specified nodes for job allocation (if necessary additional nodes will be added to fulfill the requirement for number of nodes)
--requeue or
--no-requeue
 no-requeue Specifies whether the batch job should be requeued after a node failure. Caution: if a job is requeued, the whole batch script is initiated from its beginning !

The complete list of parameters can be inquired from the sbatch man page:

$ man sbatch

As already mentioned above the batch script is submitted using the SLURM sbatch command:

$ sbatch [OPTIONS] <job script>

On success, sbatch writes the job ID to standard output. Options provided on command line supersede  the same options defined in the batch script.

NOTE: On the Mistral specification of -A resp. --account is necessary to submit a job, otherwise submission will be rejected. You can query the accounts for which job submission is allowed using the SLURM sacctmgr command:

$ sacctmgr -s show user name=$USER

Furthermore, you will have to specify the partition on which the job will run by using either the -p resp. --partition option to sbatch. Otherwise the submission will be rejected (note: we will enforce this starting at September 1st 2016 - please be prepared and modify your batch scripts accordingly).

CAUTION: setting SLURM options via environment variables will override any matching options set in a batch script, and command line options will override any matching environment variable.

Remember the difference between options for selection, allocation and distribution in SLURM. Selection and allocation work with sbatch, but task distribution and binding should directly be specified with srun (within batch script). The following four steps give an overview.

1. Resource selection, e.g.

 #SBATCH --nodes=2
 #SBATCH --sockets-per-node=2
 #SBATCH --cores-per-socket=12

which could be satisfied by both partitions compute (Haswell nodes) and compute2 (Broadwell nodes). That is why an explicit declaration of the chosen partition is needed.

2. Resource allocation, e.g.

 #SBATCH --ntasks=12
 #SBATCH --ntasks-per-node=6
 #SBATCH --ntasks-per-socket=3
 #SBATCH --cpus-per-task=8

3. Start application relying on the sbatch options only. Task binding and distribution with srun, e.g.

srun --cpu_bind=cores --distribution=block:cyclic ./my_program

4. Start application using only a subset of the allocated resources. In this case you  need to give again all relevant allocation options to srun (like --ntasks or --ntasks-per-node), e.g.

srun --ntasks=2 --ntasks-per-node=1 --cpu_bind=cores --distribution=block:cyclic ./my_program

All environment variables set at the time of submission are propagated to the SLURM jobs. With some options of the allocation commands (like - -export for sbatch or srun), users can change this default behaviour. The users can load modules and prepare the desired environment before job submission, and then this environment will be passed to the jobs that will be submitted. Of course, a good practice is to include module commands in job scripts, in order to have full control of the environment of the jobs.

 

SLURM Command Examples

Query Commands

Normally, the jobs will pass through several states during their life cycle. Typical job states from submission until completion are: PENDING (PD), RUNNING (R), COMPLETING (CG) and COMPLETED (CD). Further job state codes are described in squeue man page. Below some examples of SLURM query commands  are provided.

  • List all jobs submitted to SLURM:
    $ squeue
  • List all jobs submitted by you:
    $ squeue -u $USER
  • Check available partitions and nodes:
    $ sinfo

    The sinfo command reports the states of the partitions and the nodes. The partitions may be in state UP, DOWN or INACTIVE. The UP state means that a partition will accept new submissions and the jobs will be scheduled. The DOWN state allows submissions to a partition but the jobs will not be scheduled. The INACTIVE state means that submissions are not allowed. The nodes also can be in various states, such as alloc (allocated), comp (completing), down, idle, maint, resv (reserved) etc. Description of all node states can be get from the sinfo man page.

  • List partition state summary
    $ sinfo -s

    The column NODES(A/I/O/T) shows number of nodes in the states "allocated/idle/other/total" for each SLURM partition.

  • Query configuration and limits for one specific partition (here compute):
    $ scontrol show partition compute
  • Check one node (here m10010):
    $ scontrol show node m10010

 

Job control

  • Cancel job with SLURM JobId 4711:
    $ scancel 4711
  • Cancel all your jobs:
    $ scancel -u $USER

    With the additional option -i (interactive mode) SLURM asks for confirmation before canceling the job.

  • Display status information of a running job 4242:
    $ sstat -j 4242
    sstat provides various status information (e.g. CPU time, Virtual Memory (VM) usage, Resident Set Size (RSS), Disk I/O etc.) for running jobs. The metrics of interest can be specified using option --format or -o (s. next example).
  • Display selected status information of the running job 4242:
    $ sstat -o jobid,avecpu,avepages,maxrss,maxvmsize -j 4242
    
    # For a list of all available metrics use the option --helpformat or look into sstat man page
    $ sstat --helpformat
    $ man sstat
  • Hold pending job with SLURM JobId 5711:
    $ scontrol hold 5711
  • Release job with SLURM JobId 5711:
    $ scontrol release 5711

 

Accounting Commands

  • Check user association (here for user b123456):
    $ sacctmgr show assoc where user=b123456
    or
    $ sacctmgr -s show user name=b123456
  • Check job history for user b123456:
    $ sacct -X -u b123456
  • Check job history (jobid, number of nodes, list of nodes, job state and exit code) for user b123456 in specified time period (January 2015):
    $ sacct -X -u b123456 -o "jobid,nnodes,nodelist,state,exit" -S 2015-01-01 -E 2015-01-31T23:59:59
  • Check memory usage for the completed job with the jobid 12345:
    sacct --duplicates -j 12345 --format=JobID,JobName,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask

 

Interactive Usage

Interactive sessions can be allocated using the SLURM salloc command. The following command for example will allocate 2 Haswell nodes (in the compute partition) for 30 minutes:

$ salloc --partition=compute --nodes=2 --time=00:30:00 --account xz0123

Once an allocation has been made, the salloc command will start a bash on the login node where the submission was done. After a successful allocation the users can execute srun from that shell to spawn interactively their applications. For example:

$ srun --ntasks=4 --ntasks_per_node=2 --cpus_per_task=4 ./my_program

The interactive session is terminated by exiting the shell. In order to run commands directly on the allocated compute nodes, the user has to use ssh to connect to the desired nodes. For example:

$ salloc --partition=compute --nodes=2 --time=00:30:00 -A xz0123
salloc : Granted job allocation 13258

$ squeue -j 13258
JOBID PARTITION NAME USER    ST TIME NODES NODELIST(REASON)
13258 compute   bash x123456 R  0:11 2     m[10001-10002]

$ hostname # we are still on the login node
mlogin100

$ ssh m10001
user@m10001's password:

user@m10001~$ hostname
m10001

user@m10001~$ exit
logout
Connection to m10001 closed.

$ exit
salloc : Relinquishing job allocation 13258
salloc : Job allocation 13258 has been revoked.

 

Requesting specific features for nodes

When asking SLURM to provide a dedicated set of nodes to the user, one has to use the --constraint option. Different constraints can be combined using the AND or OR operator, e.g.

--constraint="m40&512G"

This option is especially useful if your job needs a specific kind of GPU or ammount of memory. Please refer to the detailed hardware list to identify which GPUs are available. In general we provide nodes with the memory features: 64G, 128G, 256G, 512G, and 1024G (the latter two memory options are just available on the gpu partition); and gpu features: k80, m40, m6000, and v100.

Document Actions