You are here: Home / Systems / Mistral / Running Jobs / Using srun

Using srun

Overview

With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited. Srun should preferably be used either

  • inside a job script submitted by sbatch
  • or after calling salloc

The allocation options of the srun commands for the job steps are (almost) the same as for sbatch and salloc described in the SLURM introduction. The example command below spawns 48 tasks on 2 nodes (24 tasks per node) for 30 minutes:

$ srun -N 2 -n 48 -t 30 -A xz0123 ./my_small_test_job

You need to specify the project account to be charged for this job in the same manner as for salloc and sbatch.

 

Process and Thread Binding

MPI jobs

Mapping of processes (i.e. placement of ranks on nodes and sockets) can be specified via srun option - -distribution (or -m). The syntax is as follows:

srun --distribution=<block|cyclic|arbitrary|plane=<options>[:block|cyclic]>

The first argument (before the ":") controls the distribution of ranks across nodes (e.g. block or cycle on successive nodes). The second (optional) distribution specification (after the ":") controls the distribution of ranks across sockets within a node (i.e. block or cycle on successive sockets).

Process/task binding to cores and CPUs can be done via srun options - -cpu_bind. The syntax is:

srun --cpu bind=[{quiet,verbose},]type
  • To bind tasks to physical cores replace type by cores
  • To bind tasks to logical CPUs / Hyper-Threads replace type by threads
  • For custom bindings use map_cpu:<list>, where <list> is a comma separated list of CPUIDs (0,1,2...,23)

For details please take a look at the man page of the srun command or contact DKRZ user’s consultancy.
In most cases use

srun --cpu_bind=verbose,cores --distribution=block:cyclic ./myapp

if you do not want to use Hyper-Threads and

srun --cpu_bind=verbose,threads --distribution=block:cyclic ./myapp

if you intend to use Hyper-Threads. You might also benefit from different task distributions than block:cyclic.

 

OpenMP jobs

Thread binding is accomplished via Intel runtime library using the KMP_AFFINITY environment variable. The syntax is

export KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]

with

  • modifier
    – verbose: giving detailed output on how binding was done
    – granularity=core: reserve full physical cores (i.e. two logical CPUs) to run threads on
    – granularity=thread/fine: reserve logical CPUs / Hyper-Threads to run threads
  • type
    – compact: places the threads as close to each other as possible
    – scatter: distributes the threads as evenly as possible across the entire allocation
  • permute controls which levels are most significant when sorting the machine topology map, i.e.. 0=CPUs (default), 1=cores, 2=sockets/LLC
  • offset  indicates the starting position for thread assignment

For details please take a look at the Intel manuals or contact DKRZ user’s consultancy. In most cases use

export KMP_AFFINITY=granularity=core,compact,1 

if you do not want to use Hyper-Threads and

export KMP_AFFINITY=granularity=thread,compact,1

if you intend to use Hyper-Threads. You might also try scatter instead of compact placement to take benefit from bigger L3 cache.

Hybrid MPI/OpenMP jobs

In this case you need to combine the two binding methods mentioned above. Keep in mind that we are using --threads-per-core=2 throughout the cluster. Hence you need to specify the amount of CPUs per process/task on the basis of Hyper-Threads even if you do not intend to use HyperThreads! Examples on how to achieve correct binding using a full node can be found in the HLRE-3 MISTRAL User's Manual.

Multiple Program Multiple Data

SLURM supports the MPMD (Multiple Program Multiple Data) execution model that can be used for MPI applications, where multiple executables can have one common MPI_COMM_WORLD communicator. In order to use MPMD the user has to set the srun option - -multi-prog <filename>. This option expects a configuration text file as an argument, in contrast to the SPMD (Single Program Multiple Data) case where srun has to be given the executable.

Each line of the configuration file can have two or three possible fields separated by space and the format is

<list of task ranks> <executable> [<input arguments>]

In the first field a comma separated list of ranks for the MPI tasks that will be spawned is defined. Possible values are integer numbers or ranges of numbers. The second field is the path/name of the executable. And the third field is optional and specifies the arguments of the program.

The following example provides are job script frame for execution of a coupled atmosphere-ocean model using 8 nodes:

#!/bin/bash

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=24
#SBATCH --time=00:30:00
#SBATCH --exclusive
#SBATCH --account=xz0123

# Domain decomposition for atmosphere model
ECHAM_NPROCA=6
ECHAM_NPROCB=16

# Domain decomposition for ocean model
MPIOM_NPROCX=12
MPIOM_NPROCY=8

# Paths to executables
ECHAM_EXECUTABLE=../bin/echam6
MPIOM_EXECUTABLE=../bin/mpiom.x

# Derived values useful for creation of MPMD configuration file
(( ECHAM_NCPU = ECHAM_NPROCA * ECHAM_NPROCB ))
(( MPIOM_NCPU = MPIOM_NPROCX * MPIOM_NPROCY ))
(( NCPU = ECHAM_NCPU + MPIOM_NCPU ))
(( MPIOM_LAST_CPU = MPIOM_NCPU - 1 ))
(( ECHAM_LAST_CPU = NCPU - 1 ))


# Create MPMD configuration file
cat > mpmd.conf <<EOF
0-${MPIOM_LAST_CPU} $MPIOM_EXECUTABLE
${MPIOM_NCPU}-${ECHAM_LAST_CPU} $ECHAM_EXECUTABLE
EOF

# Run MPMD parallel program using Intel MPI
module load intelmpi
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun -l --cpu bind=verbose,cores --multi-prog mpmd.conf

 

Frequency Scaling

The Intel Haswell and Broadwel processors allow for CPU frequency scaling which in general enables the operating system to scale the CPU frequency up or down in order to save power. CPU frequencies can be scaled automatically depending on the system load or manually by userspace programs. This is done via power schemes for the CPU - so called governors. Only one may be active at a time. The default governor is ”ondemand” which allows the operating system to scale down the CPU frequency on the compute nodes to 1.2GHz if they are in idle state. The user can set the governor to ”userspace” and specify a fixed CPU frequencies instead. Therefore the batch job needs to define the desired behaviour via the environmental variable SLURM_CPU_FREQ_REQ or via the srun option - -cpu-freq.

To set the CPU frequency to the nominal value (2.5 GHz for Haswell in the compute partition and 2.1 GHz for Broadwell in the compute2 partition) use:

export SLURM_CPU_FREQ_REQ=HighM1

You might also request a different frequency (that must be specified in kHz, e.g. 2100000) or enable automatic frequency scaling depending on the workload by setting

export SLURM_CPU_FREQ_REQ=ondemand

On the DKRZ HPC system mistral we are using SLURM plugins to configure all cores to run at the fixed nominal frequency depending on the chosen CPU (Haswell or Broadwell) if you are using srun to execute a parallel job. Therefore, you normally do not need to explicitly set the frequency.

Document Actions