You are here: Home / Help / Known Issues

Known Issues

Occasionally failing MPI jobs

Some versions of MPI on mistral might cause failing jobs with one of these symptoms:

  • jobs seem to abruptly lock up, producing no more output
  • jobs running into SLURM wallclock limit, although the output suggest that a normal end of the program was reached
  • error message given look like one of the following
    • UCM connect: REQ RETRIES EXHAUSTED
    • The InfiniBand retry count between two MPI processes has been exceeded

These failures were analysed in depth by DKRZ and a team of experts at Atos/BULL R&D. We finally installed updated version of both IntelMPI and Bullx/OpenMPI to solve these issues.

It might be needed to recompile and/or modify your existing code. Please have a look at the details here: MPI lock up problem.

 

srun error given when a job is submitted from within another job

If you see MPI jobs failing to start with an error message complaining about wrong resource usage, e.g.

srun: error: Unable to create job step: More processors requested than permitted
srun: Warning: can't run 1 processes on 8 nodes, setting nnodes to 1

you most probably submitted the job from within another job.

This behaviour is a SLURM specific feature. The current environment gets interpreted by the sbatch cmd. Be aware that any environment variables already set in sbatch’s environment will override any options set in a batch script, and command line options will override any environment variables. Therefore, clear any SLURM related environment variables before calling sbatch if you do not want SLURM to interprete them. You might use a short alias to do the job

alias clearSLURMenv='for i in `env | grep -E "SLURM_|SBATCH_" | cut -d= -f1`; do unset $i; done'

Furthermore, DKRZ modified SLURM installation such that the sbatch command accepts the option '--purge-slurm-env' to automatically submit a new job with a clean environment, ie. you might use

sbatch --purge-slurm-env [further options]

 

Fatal error in MPI_Init: Other MPI error; using IntelMPI/2017 and srun --distribution=cyclic

When using IntelMPI 2017.* in combination with SLURMs srun option -m/--distribution to specify a cyclic internode distribution (in a round-robin fashion), the batch job fails in MPI_Init with the following error message:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(799).................: fail failed
MPID_Init(1769).......................: channel initialization failed
MPIDI_CH3_Init(126)...................: fail failed
MPID_nem_init_ckpt(836)...............: fail failed
MPIDI_CH3I_Seg_commit(433)............: fail failed
MPIU_SHMW_Hnd_deserialize(323)........: fail failed
MPIU_SHMW_Seg_open(921)...............: fail failed
MPIU_SHMW_Seg_create_attach_templ(641): open failed - No such file or directory

The problem is reported to Intel, it is based on the fact that intelmpi-2017 versions by default use an optimized startup algorithm when SLURM is used. Unfortunately, this is not well defined for the cyclic internode layout of tasks.

As a workaround, we recommend to not specify the distribution in srun explicitely (ie. use teh default block internode distribution) or to set the envVar I_MPI_SLURM_EXT=0 to disable the optimized startup algorithm.

 

MATLAB fails to create the default preferences folder $HOME/.matlab/R2015a

Under certain circumstances MATLAB can not be started on Mistral due to the following error:

Warning: failed to create preference directory ${HOME}/.matlab/R2015a.
Check directory permissions.MATLAB Shutdown : Failed To Create Preference Directory.

Creating the preferences folder ${HOME}/.matlab/R2015a manually and checking the correctness of the access permissions does not resolve the problem. In this case the program start is terminated with the following error:

MATLAB:settings:core:NoWritePermissionOnDirectory Internal Error: No write permission on directory $HOME/.matlab/R2015a/temp0x2590x3dc9. Details: fl:filesystem:AccessDenied..
Caught "std::exception" Exception message is:
FatalException(unknown)
Caught MathWorks::System::FatalException
>> Exception in thread "AWT-EventQueue-0" java.lang.RuntimeException: com.mathworks.services.settings.SettingValidationException: IMsgIDException for "matlab.workingfolder.LastFolderPath": Internal Error: No write permission on directory /home/dkrz/k202072/.matlab/R2015a/temp0x27e0x3dc9. Details: fl:filesystem:AccessDenied.
    at com.mathworks.mlwidgets.prefs.InitialWorkingFolder.setStringSettingValue(InitialWorkingFolder.java:52)
    at com.mathworks.mlwidgets.prefs.InitialWorkingFolder.access$000(InitialWorkingFolder.java:15)
    at com.mathworks.mlwidgets.prefs.InitialWorkingFolder$1.actionPerformed(InitialWorkingFolder.java:34)
    at com.mathworks.jmi.MatlabPath$DeferredActionEvent.dispatch(MatlabPath.java:151)
    at com.mathworks.util.QueueEvent$QueueTarget.processEvent(QueueEvent.java:89)
    at java.awt.Component.dispatchEventImpl(Unknown Source)
    at java.awt.Component.dispatchEvent(Unknown Source)
    at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
    at java.awt.EventQueue.access$200(Unknown Source)
    at java.awt.EventQueue$3.run(Unknown Source)
        ...
    ... 24 more

The described situation occurs when you try to start up MATLAB for the first time not in your HOME directory but for example in WORK or SCRATCH. This malfunction could be traced back to an unreliable implementation of the folder creation in the Linux version of MATLAB. To resolve the issue we recommend to start up the MATLAB in your HOME directory and (if necessary) change the working directory thereafter.

Another annoying problem is a relatively slow start up and long response times of MATLAB on Mistral. This can essentially be explained by the following reasons:

  • MATLAB and its numerous toolboxes are installed on the Lustre parallel file system. This results in a slower start of the program compared to an installation on a local file system
  • Using of ssh with X11-forwarding usually slow downs the start up and response times due to sending of many small packets over network
  • Overloaded mistralpp nodes (i.e. high CPU, I/O and memory utilization) due to user processes increases waiting times for resources. The machine load can be checked with the top command.


A recipe describing how to improve the interactive performance of MATLAB using VNC connection to Mistral GPU nodes can be found in our FAQ section.

Document Actions