Presentation
Improvement of Bridges-2 Resource Utilization Through User Optimization
DescriptionThis poster presents our two-phase solution for improving GPU utilization in NSF-funded ACCESS high-performance computing (HPC) clusters, with a pilot implementation on Pittsburgh Supercomputing Center’s Bridges-2. Our approach addresses the limitations of Open XdMoD, which lacks per-job GPU usage monitoring and experiences delays in data availability. In phase one, we develop a data ingestion layer to collect GPU indices and resource usage data, utilizing existing software tools for efficient data aggregation and analysis. Analyzing 5,717 completed GPU jobs revealed issues such as workflow configuration errors, framework misconfigurations, and low GPU utilization. In phase two we create a user-facing platform with modern web tools. This platform will automatically detect inefficiencies, notify users via email, and provide actionable insights to optimize resource management. By addressing these issues and integrating real-time data presentation, we aim to enhance overall system utilization, reduce GPU job wait times, and enable more efficient use of existing resources.

Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX