BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234543Z
LOCATION:B313
DTSTART;TZID=America/New_York:20241117T103000
DTEND;TZID=America/New_York:20241117T104000
UID:submissions.supercomputing.org_SC24_sess732_ws_canolt105@linklings.com
SUMMARY:Containerized Checkpoint-Restart Mechanisms for HPC
DESCRIPTION:Madan Timalsina and Nicholas Tyler (NERSC at LBNL (Lawrence Be
 rkeley National Laboratory))\n\nHigh-performance computing (HPC) systems a
 re crucial for solving complex scientific problems but challenges like res
 ource management, fault tolerance, and maintaining consistent performance 
 across diverse environments can be difficult. Container technologies, like
  NERSC's Shifter and Podman-HPC, offer some solutions to these problems. T
 his study uses Distributed MultiThreaded CheckPointing (DMTCP) technologie
 s to implement robust checkpoint/restart (C/R) mechanisms to handle challe
 nges with fault tolerance and resource management, within containerized en
 vironments. This study highlights successful C/R implementations on Perlmu
 tter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader 
 adoption in the HPC container space, to show where C/R within containers c
 ould be used at more HPC centers. Work on MPI-Agnostic Network-Agnostic (M
 ANA) and containerized C/R for GPUs is also being pursued as part of futur
 e developments. These insights emphasize the growing importance of resilie
 nce in containerization deployments in scientific computing, ultimately ac
 celerating the pace of discovery and innovation.\n\nTag: Cloud Computing, 
 Middleware and System Software, State of the Practice\n\nRegistration Cate
 gory: Workshop Reg Pass\n\nSession Chairs: Richard Shane Canon (Lawrence B
 erkeley National Laboratory (LBNL)), Alberto Madonna (Swiss National Super
 computing Centre (CSCS)), Claudia Misale (IBM), and Andrew Younge (Sandia 
 National Laboratories)\n\n
END:VEVENT
END:VCALENDAR
