Sie sind hier: Startseite / Systems / monitoring.dkrz.de
Info
Alle Inhalte des Nutzerportal sind nur auf Englisch verfügbar.

monitoring.dkrz.de

SLURM workload manager and its batch jobs running on the DKRZ HPC system are automatically monitored for used resources. The webinterface monitoring.dkrz.de allows users to view this data.
DKRZ operates the monitoring server in extended testing phase, which means that we do not guarantee fulltime availabilty of the system and some more limitations as documented here.

What is monitoring.dkrz.de?

The webserver basically runs a grafana (https://grafana.com/) instance which serves different dashboards showing statistics about the usage of the DKRZ HPC system.

All DKRZ users can login to the system using their DKRZ user account:

The  main page shows the current load of each MISTRAL login node where the background color indicates if high load (number of running processes) is given (red) or not (green). Furthermore, the blue graph is by default the history of last 2 days.

In general, the range of shown data can be set in the upper right to an arbitraty interval or prescribed time horizont like 'Last 24 hours'.

At the bottom of the page, the current number of nodes used per SLURM partition is displayed by a gauge. The more nodes are used, the more redish the coloring is. Again, in the background historical data is shown.

 

If you click on one of the current metrics (either login node load or number of used compute nodes), you will get a dashboard with detailed data about this component. Especially the SLURM queue statistics can be used to identify at which times the HPC system is occupied or not.

What data is covered by DKRZ monitoring regarding batch jobs?

In addition to the generic data on how login nodes and SLURM partitions of the HPC system are used, node based monitoring data is captured for all batch jobs handled by SLURM. This data is tied to your user account and accessible from the main page via the 'jobstats' button (direct link: https://monitoring.dkrz.de/dashboard/script/scripted_jobstats.js).

For the time frame chosen (by default the last 24 hours), you will see a list of all batch jobs that ran using your account. At the top left of the page you can use the drop down menu to select the jobid for which monitoring data shall be presented. This automatically restricts the second drop down menu to the nodes allocated for this job. You can select any number of nodes to be show in the graphs below. For a better readability you might only chose the head node of the job allocation and one or two representative other nodes.

In all graphs shown, the time range in which the job ran is marked by vertical bars (green: jobstart, red: jobend). Use the zoom-function to shrink the time range accordingly.

The following metrics are available (not all are captured by default):

  • CPU usage and frequency
  • memory data
  • lustre filesystem data
  • InfiniBand network data

Please refer to the next section for detailed information.

How to enable more detailed monitoring for my jobs?

Capturing node based performance metrics is controlled via SLURM plugin. To modify the amount and frequency of collected data use the sbatch --monitoring option, e.g.

#SBATCH --monitoring=0 # disable collecting data

or

#SBATCH --monitoring=meminfo=10,lustre=5 # capture memory data every 10 sec and lustre data every 5 sec

The syntax is --monitoring=[0 | 1 | <list of metrics>] where <list of metrics> is a comma separated list of tuples metricname=interval. Interval is given in seconds and detemines how often data is captures, for metricname the following metrics are available

  • cpu : cpu usage (usr, system, idle, iowait) gathered by sockets to e.g. identify non optimal CPU binding, cpu frequency per core/HT to identify throttling
  • meminfo : memory usage (available, cached, active) allowing to immediately recognize leaks in application
  • lustre : read/write bytes per second
  • power : energy consumption (currently only available with high overhead, should not be used in general)

In addition the Infinband network statistics are always captured (if possible) showing the transmitted and received data, and also transmit waits which causes the communictaion to stall.

How to read the graphs?

Overview (aggregated base metrics)

For all nodes selected from the drop-down menu at the top of the monitoring webpage, three basic metrics are show in the first panel row:

Memory Usage shows the used memory per node, Load Avg shows the number of active processes per node, and Network Congestion shows a derived metric that illustrates the percentage of lost bandwidth due to Infiniband packages that were not sent/received immediately.

CPU usage

For each node selected this panel shows how the COREs of that node are used. Two major states are presented: time in user space (line) and IOwait (dots). A false binding of processes can therefore be seen, if some COREs show a load in user space of 200% while other just have 0%. This might happen due to wrong usage of HyperThreads. In the picture below the application first ran with correct binding and afterwards with false binding resulting in wasted resources.

CPU frequency

This metric is sampled at a lower interval than specified via sbatch --monitoring option. Its main purpose is to see at a glance whether the used CPU frequency is apropriate for the chosen partition. Since the compute and compute2 partition are equipped with different CPUs (Intel HSW vs. BDW) at different clock speeds, wrong settings in SLURM scripts might easily lead to lower CPU frequency than possible.

Memory data

In contrast to the aggregated memory statistic, these panels show detailed memory consumption for each node. The used memory is separated into Active, Cached and Inactive (File) memory - hence contributing to the Free and Available memory (https://www.linuxatemyram.com).

Lustre data

Data that is read from or written to the lustre filesystem can be processed at different speeds depending on e.g. the number of used threads if parallel I/O is used. This panel illustrates the rate at which data is transfered per node.

IB data

Messages sent over the Infiniband network are monitored. Here the accumulated bandwidth for each node is given - both for send and received messages. Currently it is not distinguished between network traffic to other compute nodes (like MPI messages) or to the lustre filesystem (i.e. traffic for IO).

Artikelaktionen