Presentation
Containerized Checkpoint-Restart Mechanisms for HPC
DescriptionHigh-performance computing (HPC) systems are crucial for solving complex scientific problems but challenges like resource management, fault tolerance, and maintaining consistent performance across diverse environments can be difficult. Container technologies, like NERSC's Shifter and Podman-HPC, offer some solutions to these problems. This study uses Distributed MultiThreaded CheckPointing (DMTCP) technologies to implement robust checkpoint/restart (C/R) mechanisms to handle challenges with fault tolerance and resource management, within containerized environments. This study highlights successful C/R implementations on Perlmutter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader adoption in the HPC container space, to show where C/R within containers could be used at more HPC centers. Work on MPI-Agnostic Network-Agnostic (MANA) and containerized C/R for GPUs is also being pursued as part of future developments. These insights emphasize the growing importance of resilience in containerization deployments in scientific computing, ultimately accelerating the pace of discovery and innovation.