Close

Presentation

Containerized Checkpoint-Restart Mechanisms for HPC
DescriptionHigh-performance computing (HPC) systems are crucial for solving complex scientific problems but challenges like resource management, fault tolerance, and maintaining consistent performance across diverse environments can be difficult. Container technologies, like NERSC's Shifter and Podman-HPC, offer some solutions to these problems. This study uses Distributed MultiThreaded CheckPointing (DMTCP) technologies to implement robust checkpoint/restart (C/R) mechanisms to handle challenges with fault tolerance and resource management, within containerized environments. This study highlights successful C/R implementations on Perlmutter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader adoption in the HPC container space, to show where C/R within containers could be used at more HPC centers. Work on MPI-Agnostic Network-Agnostic (MANA) and containerized C/R for GPUs is also being pursued as part of future developments. These insights emphasize the growing importance of resilience in containerization deployments in scientific computing, ultimately accelerating the pace of discovery and innovation.
Event Type
Workshop
TimeSunday, 17 November 202410:30am - 10:40am EST
LocationB313
Tags
Cloud Computing
Middleware and System Software
State of the Practice
Registration Categories
W