Sie sind hier: Startseite / Systems / Mistral / Running Jobs / Adapting job-scripts for mistral phase2
Info
Alle Inhalte des Nutzerportal sind nur auf Englisch verfügbar.

Adapting job-scripts for mistral phase2

Summary of all major difference between compute and compute2 partition of mistral showing how to adapt job-scripts in the right way.

Since phase1 and phase2 nodes of mistral are equipped with different Intel CPUs, you will have to slightly adapt your existing job scripts in oder to use both partitions. The following table gives an overview on the differences and which partitions are affected.

phase partitions CPU cores per node processor frequency
1 compute, prepost, shared, gpu, miklip Xeon E5-2680 v3 processor (Haswell - HSW) 24 2.5 GHz
2 compute2 Xeon E5-2695 v4 processor (Broadwell - BDW) 36 2.1 GHz

As the table indicates just two issues arise if batch scripts should be useable for both phases:

  • different number of cores per node
  • different processor frequency

Setting the right CPU frequency for each partition

SLURM allows to request that the job step initiated by the srun command shall be run at the requested frequency (if  possible) on the CPUs selected for that step on the compute node(s). This can be done via

  • srun option --cpu-freq
  • environmental variable SLURM_CPU_FREQ_REQ

If none of these options is set, DKRZ slurm automatically chooses the apropriate frequency for the underlying processor. We therefore recommend to not set the frequency explicitly.

In case that a wrong frequency is defined via envVar (e.g. setting SLURM_CPU_FREQ_REQ=2500000 for the BDW nodes in compute2 partition) a warning message on stdout is given like

[DKRZ-slurm WARNING] CPU-frequency chosen (2500000) not supported on partition compute2 - frequency will be set to nominal instead!

If you are using a wrong frequency for the srun option --cpu-freq, a warning message on stdout is given, but this time the automatic frequency adaption falls back to the minimal frequency:

[DKRZ-slurm WARNING] CPU-frequency chosen (2501000) not supported on partition compute2 - frequency will fall back to minimum instead!

Setting the right number of cores

When allocating nodes using the sbatch or salloc command one has to specify the targeted partition and therefore the type of CPU directly. Nevertheless, jobscripts that were originally written to run on the 24 core Intel Haswell nodes (ie. in the compute partition) will in general also run in the compute2 partition but do not make use of the full node!

The most critical sbatch/srun option in this context is --ntasks-per-node. Setting e.g. a value of 24 is appropriate for Haswell nodes but uses only 2/3 of the cores on Broadwell nodes. Hence, you should pay major attention to this when adapting your batch scripts for the compute2 partition.

Writing more flexible batch scripts, that are able to run on both kinds of CPU, requires avoidance of sbatch options that prescribe the number of tasks per entity, ie. you should not use a prescribed number of nodes or one of

  • --ntasks-per-core=<ntasks>
  • --ntasks-per-socket=<ntasks>
  • --ntasks-per-node=<ntasks>
  • --tasks-per-node=<n>

Instead define the total number of tasks that your MPI parallel program will be using for srun by specifying

-n <number> or --ntasks=<number>

in combination with the number of logical CPUs needed per task, e.g. --cpus-per-task=2 for a pure MPI parallel program not using HyperThreading.

The following batch script will run an MPI job without HyperThreading either in the compute or compute2 partition - only depending on the #SBATCH --partition choice.

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=compute # or compute2 for BDW nodes
#SBATCH --ntasks=72
#SBATCH --cpus-per-task=2 #SBATCH --time=00:30:00 #SBATCH --mail-type=FAIL #SBATCH --account=xz0123 #SBATCH --output=my_job.o%j #SBATCH --error=my_job.e%j
srun -l --propagate=STACK --cpu_bind=cores ./myprog

When submitted to the compute partition, the job will run on 3 nodes with 24 tasks per node. While in the compute2 partition the same job only takes 2 nodes with 36 tasks per node.

Writing job scripts eligible to run on several partitions

The --partition option also allows for a comma separate list of names. In this case the job will run completely on the partition offering earliest initiation with no regard given to the partition name ordering - ie. nodes will not be mixed between the partitions! Be aware that the total number of tasks should be a multiple of both, 24 and 36, in order to fully populate all nodes. Otherwise, some nodes might be underpopulated. The example above might therefore be modified to use

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=compute, compute2
#SBATCH --ntasks=72
#SBATCH --cpus-per-task=2
#SBATCH --mem=0
...

which in general will decrease the waiting time of the job in the submit queue since more nodes are suitable to schedule the job on.

Attention: there are a few facts one needs to be aware of when using job script that are eligible for more than one partition

  • compute2 partition (Broadwell nodes) shows a slightly lesser performance due to the lower CPU-frequency compared to compute partition (Haswell nodes). Runtime limits for the job should therefore be modified.
  • the number of --cpus-per-task needs to be specified explicitely to enable or disable the usage of Intel HyperThreading. By default --cpu-per-task=1 will be choosen to allow for HyperThreading. Since your code might not benefit of this setting, you should switch to --cpu-per-task=2 instead. For details on Intel® Hyper-Threading please refer to this site.
  • the memory resources should be specified with care! The compute and compute2 partitions are equipped with different nodes (24 vs. 36 cores and memory ranging from 64 to 256 GB) which results in very different values for the --mem-per-cpu option, which is computed automatically by SLURM if not specified explicitely. This calculation might be wrong in case that two partitions are specified! Therefore, one should always use the option --mem=0 such that the job has access to the full memory on each node, independend of the partition that is finally used for the job.

Artikelaktionen