Presentation
Integrating Evolutionary Algorithms with Distributed Deep Learning for Optimizing Hyperparameters on HPC Systems
DescriptionHigh performance computing (HPC) systems have become essential for solving complex scientific problems, particularly in the context of deep learning (DL). This extended abstract presents a novel system that uses a multiobjective evolutionary algorithm (EA) to optimize hyperparameters for a deep learning model, AtomAI, to minimize validation training loss and energy use. We will be using the parallel and distributed computing capabilities of Dask and the scalable provenance features of FlowCept to measure CPU and
GPU resource usage as proxies for energy consumption. Our approach focuses on integrating multiple software components to operate efficiently on large-scale HPC systems, specifically targeting the OLCF's Frontier supercomputer, but should be generalizable to other HPC environments.
GPU resource usage as proxies for energy consumption. Our approach focuses on integrating multiple software components to operate efficiently on large-scale HPC systems, specifically targeting the OLCF's Frontier supercomputer, but should be generalizable to other HPC environments.