BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143138Z
LOCATION:B306
DTSTART;TZID=America/New_York:20241122T083200
DTEND;TZID=America/New_York:20241122T093000
UID:submissions.supercomputing.org_SC24_sess767_misc198@linklings.com
SUMMARY:Error-Resilient Machine Learning for HPC: Challenges and Opportuni
 ties
DESCRIPTION:Karthik Pattabiraman (University of British Columbia, Canada)\
 n\nMachine learning (ML) has increasingly been adopted in safety-critical 
 systems such as autonomous vehicles (AVs) and industrial robotics. In thes
 e domains, reliability and safety are important considerations, hence it i
 s critical to ensure the resilience of ML systems to faults and errors. Th
 is also applies to ML systems deployed in the HPC context. On the other ha
 nd, soft errors are becoming more frequent in commodity computer systems d
 ue to the effects of technology scaling and reduced supply voltages. Furth
 er, traditional solutions for masking hardware faults such as triple-modul
 ar redundancy (TMR) are prohibitively expensive in terms of their energy a
 nd performance overheads. Therefore, there is a compelling need to provide
  low-cost error resilience to ML applications on commodity HPC platforms. 
 I will present three directions we have explored in my research group towa
 rds this goal. <br /><br />First, we experimentally assessed the resilienc
 e of ML applications to soft errors via fault injection. We found that eve
 n a single bit flip due to a soft error can lead to misclassification in d
 eep neural network (DNN) applications. Such misclassifications can result 
 in safety violations. However, not all errors result in safety violations,
  and so it is sufficient to protect the DNN from the ones that do. Unfortu
 nately, finding all possible errors that result in safety violations is a 
 very compute-intensive task. <br /><br />Second, we proposed BinFI, a faul
 t injection approach that efficiently injects critical faults that are hig
 hly likely to result in safety violations, by leveraging the DNN’s propert
 ies. <br /><br />Finally, we proposed Ranger, an approach to protect DNNs 
 from critical faults without causing any loss in their accuracies, and wit
 h minimal performance overheads. I will conclude by presenting some of our
  ongoing work as well as the future challenges in this area. This is joint
  work with my students and colleagues at the University of British Columbi
 a, as well as industry collaborators.\n\nTag: Distributed Computing, Fault
 -Tolerance, Reliability, Maintainability, and Adaptability\n\nRegistration
  Category: Workshop Reg Pass\n\nSession Chairs: John Daly (US Department o
 f Defense); Bo Fang (University of Texas, Arlington); Scott Levy (Sandia N
 ational Laboratories); and Keita Teranishi (Oak Ridge National Laboratory 
 (ORNL))\n\n
END:VEVENT
END:VCALENDAR
