You are here: Home / News & Events / News / Lustre Crashes - Status Quo [resolved]

Lustre Crashes - Status Quo [resolved]

Jul 03, 2019

Dear users,

UPDATE: The Lustre issues mentioned below have been resolved during the maintenance on 7/8/2019.

as you all noticed both, Lustre fileystems on Mistral are no longer running stable after upgrading. Together with the vendor Cray we are trying hard to fix this issue. We would like to answer to some of the most frequently asked questions:

Q: What is causing the file system crashes?

A: Although we still don't know exactly what is the root cause for the filesystem crashes, there is indication that it is related to massive requests for attribute changes, e.g. repeatedly running a chgrp command on a single file.


Q: When will the system be stable again?

A: We will try to reproduce the error on our test system with a routine provided by Cray. Today we also expect a patch to arrive that will be tested on the test system. Depending on the outcome of the test we will define the next steps.


Q: Why do you upgrade a stable filesystem?

A: The upgrade was necessary to gain continued support for the filesystem until the end of the lifetime of Mistral presumably in 2021.


Q: Did you test the software?

A: The new software has been tested thoroughly on our test systems for weeks without showing any instabilities prior to upgrading the production system. Unfortunately there are no tests available that cover all use cases and the test system is way smaller than the production system.


We are very sorry for the inconvenience caused by this and are grateful for your patience.

Best regards,
DKRZ

Document Actions

Filed under: