Close

Presentation

AI-Based Scalable Analytics for Improving Performance and Resilience of HPC Systems
DescriptionAs high-performance computing (HPC) advances to exascale levels, its role in scientific fields such as medicine, climate research, finance, and scientific computing becomes increasingly critical. However, these large-scale systems are susceptible to performance variations caused by anomalies, including network contention, hardware malfunctions, and shared resource conflicts. These anomalies can lead to increased energy consumption, scheduling inefficiencies, and reduced application performance. Therefore, accurately and promptly diagnosing these performance anomalies is essential for maintaining the efficiency and reliability of HPC systems. Machine learning offers a powerful approach to automating the detection of such anomalies by learning patterns from the vast amounts of complex telemetry data generated by these systems. Our research focuses on increasing the efficiency and resilience of HPC systems through automated telemetry analytics, and this poster presentation will summarize our efforts and findings in this domain.