MPI Runtime Settings

The computational performance and scalability of MPI applications on Mistral can be considerably improved by an optimal choice of the runtime parameters provided by MPI libraries. Here, we present recommendations for MPI environment settings that proved to be beneficial for different model codes commonly used at DKRZ.

Modern MPI library implementations provide a large number of user-configurable parameters and algorithms for performance tuning. Although the local configuration of MPI libraries is initially performed by vendor to match the characteristics of the cluster, the performance of a specific application can often be further improved by up to 15% by optimal choice of tunable parametes.

Since tuning options are specific to an MPI library the recommendation for MPI runtime setting below are just a starting point for each version.

OpenMPI based MPI libraries

  • OpenMPI 2.0.0 and later

As a minimal environmental setting we recommend the following to make use of the Mellanox HPC-X toolkit. This is just a starting point, users will have to tune the environment depending on the used application.

export OMPI_MCA_pml=cm         # sets the point-to-point management layer
export OMPI_MCA_mtl=mxm        # sets the matching transport layer (MPI-2 one-sided comm.)
export MXM_RDMA_PORTS=mlx5_0:1

# enable HCOLL based collectives
export OMPI_MCA_coll=^fca # disable FCA for collective MPI routines
export OMPI_MCA_coll_hcoll_enable=1 # enable HCOLL for collective MPI routines
export OMPI_MCA_coll_hcoll_priority=95
export OMPI_MCA_coll_hcoll_np=8 # use HCOLL for all communications with more than 8 tasks
export HCOLL_MAIN_IB=mlx5_0:1

# disable specific HCOLL functions (strongly depends on the application)

 The ompi_info tool can be used to get detailed information about Open MPI installation and local configuration:

ompi_info --all
  • BullxMPI

In general, it is advisable to use the Bullx MPI installation with MXM (Mellanox Messaging) support to accelerate the underlying send/receive (or put/get) messages. The following variables have to be set:

export OMPI_MCA_pml=cm         # sets the point-to-point management layer
export OMPI_MCA_mtl=mxm        # sets the matching transport layer (MPI-2 one-sided comm.)
export MXM_RDMA_PORTS=mlx5_0:1

Alternatively, the default OpenMPI behavior can be specified using:

export OMPI_MCA_pml=ob1
export OMPI_MCA_mtl=^mxm

When mistral was installed it was recommended to accelerate MPI collective operations by Mellanox FCA (Fabric Collectives Accelerations) tools - this is no longer possible! If you are using old job scripts and find your jobs aborting due to FCA/FMM errors, please deactivate FCA usage with the following variable:

export OMPI_MCA_coll=^ghc,fca         # disable BULLs GHC and Mellanox FCA tools for collectives
2019-02-19: FCA is no longer supported and the centralized fcamanager is not available on mistral - we recommend switching to OpenMPI instead.

You will find the Bullx MPI documentation by BULL/Atos in the section Manuals.

Intel MPI 2017 and later

A good starting point for MPI based tuning is the following setting which enforces shared memory for MPI intranode communication and DAPL based internode communication:

export I_MPI_FABRICS=shm:dapl
export I_MPI_FALLBACK=disable
export I_MPI_SLURM_EXT=0
export I_MPI_LARGE_SCALE_THRESHOLD=8192 # set to a value larger than the number of MPI-tasks used !!!

To further tune the MPI library usage, one might enable collection of MPI call statistics, e.g.

export I_MPI_STATS=20

and analyse the results with respect to used MPI functions afterwards (see lightweigth MPI analysis). One might for example switch the used algorithm for MPI_Alltoallv via



All MPIs

Unlimited stack size might have negative influence on performance - better use the actually needed amount, e.g.

ulimit -s 102400       # using bash
limit stacksize 102400 # using csh

It is also recommended to disable core file generation if it is not needed for debugging purposes.

ulimit -c 0    # using bash
limit core 0 # using csh

In batch jobs, you will also have to propagate the modified settings from the job head node to all other compute nodes when invoking srun, i.e.

srun --propagate=STACK,CORE [any other options]