The aim of the project is to reduce power consumption while maximizing throughput in the operation of HPC systems. This is achieved by optimally adjusting system parameters that have an influence on energy consumption to the respective running jobs. The savings potential will be demonstrated at all participating data centers for two selected applications each. This project combines a comprehensive job-specific measurement and control infrastructure with machine learning (ML) techniques and software hardware co-design with the ability to control energy parameters via the runtime environments. Policies are used to specify the framework conditions, and the actual optimization of system parameters is then automatic and adaptive. In order to make the most of the potential for energy savings, automatic phase detection is being developed, as well as extensions to the MPI and OpenMP runtime environments that allow information about the application state to be communicated to the GEOPM framework. Interfaces and extensions in LIKWID will be developed to capture required time-resolved metrics on energy consumption as well as performance behavior of the application. For visualization and control of the GEOPM functionality, the framework is extended to the job-specific performance monitoring ClusterCockpit and coupled with GEOPM. The novelty of the solution approach is the development and provision of a product-ready software environment for a fully user-transparent energy optimization of HPC applications. The project builds on existing open source software components and integrates, extends and adapts them for the new requirements.





Funded by: