Close

Presentation

Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning Approach
DescriptionAs high-performance computing (HPC) systems advance towards Exascale computing, their size and complexity increase, introducing new maintenance challenges. Modern HPC systems feature data monitoring infrastructures that provide insights into the system's state. This data can be leveraged to train machine learning models to anticipate anomalies that require compute nodes to undergo maintenance procedures. This paper presents a novel approach to predicting such anomalies by creating a graph per measurement that encodes current and past sensor readings and information related to the compute node sensors. The experiments were performed with data collected from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning model can accurately predict anomalies and surpass current State-Of-The-Art (SOTA) models regarding the quality of predictions and the time horizon considered to forecast them.
Event Type
Workshop
TimeSunday, 17 November 20244:40pm - 4:50pm EST
LocationB310
Tags
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
Registration Categories
W