BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T233533Z
LOCATION:B302-B305
DTSTART;TZID=America/New_York:20241121T100000
DTEND;TZID=America/New_York:20241121T170000
UID:submissions.supercomputing.org_SC24_sess534_post115@linklings.com
SUMMARY:Fault-Tolerant Numerical Iterative Algorithms at Scale
DESCRIPTION:Alix Tremodeux (ENS Lyon)\n\nNumerical iterative algorithms ar
 e struck by multiple error types when deployed on large-scale HPC platform
 s: fail-stop errors (failures) and silent errors, striking both as computa
 tion errors and memory bit-flips. Our novel approach provides efficient fa
 ult-tolerant algorithms that are capable of detecting and correcting them 
 simultaneously. Previous works never addressed all the error types simulta
 neously.\n\nWe introduce a hierarchical periodic pattern combining various
  general-purpose and application-specific techniques and optimize its shap
 e in order to minimize the expected time per iteration. The derivation is 
 intricate because optimizing a resilience period for one error type depend
 s upon other errors possibly striking and slowing down execution progress.
 \n\nA case study with the preconditioned conjugate gradient algorithm (PCG
 ) demonstrates the good performance and flexibility of our approach, which
  easily adapts to different application and fault-tolerance parameter cost
 s (e.g. iteration, verification, checkpoint, etc.).\n\nFuture work: extens
 ion to include more case studies.\n\nRegistration Category: Tech Program R
 eg Pass, Exhibits Reg Pass\n\nSession Chairs: Ayesha Afzal (Friedrich-Alex
 ander University, Erlangen-Nuremberg; Erlangen National High Performance C
 omputing Center); Sally Ellingson (University of Kentucky); and Alan Sussm
 an (University of Maryland)\n\n
END:VEVENT
END:VCALENDAR
