Close

Presentation

A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems
DescriptionAccurate wait time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.

In this work, we investigate and develop a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.