Presentation
Error-Resilient Machine Learning for HPC: Challenges and Opportunities
DescriptionMachine learning (ML) has increasingly been adopted in safety-critical systems such as autonomous vehicles (AVs) and industrial robotics. In these domains, reliability and safety are important considerations, hence it is critical to ensure the resilience of ML systems to faults and errors. This also applies to ML systems deployed in the HPC context. On the other hand, soft errors are becoming more frequent in commodity computer systems due to the effects of technology scaling and reduced supply voltages. Further, traditional solutions for masking hardware faults such as triple-modular redundancy (TMR) are prohibitively expensive in terms of their energy and performance overheads. Therefore, there is a compelling need to provide low-cost error resilience to ML applications on commodity HPC platforms. I will present three directions we have explored in my research group towards this goal.
First, we experimentally assessed the resilience of ML applications to soft errors via fault injection. We found that even a single bit flip due to a soft error can lead to misclassification in deep neural network (DNN) applications. Such misclassifications can result in safety violations. However, not all errors result in safety violations, and so it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute-intensive task.
Second, we proposed BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, by leveraging the DNN’s properties.
Finally, we proposed Ranger, an approach to protect DNNs from critical faults without causing any loss in their accuracies, and with minimal performance overheads. I will conclude by presenting some of our ongoing work as well as the future challenges in this area. This is joint work with my students and colleagues at the University of British Columbia, as well as industry collaborators.
First, we experimentally assessed the resilience of ML applications to soft errors via fault injection. We found that even a single bit flip due to a soft error can lead to misclassification in deep neural network (DNN) applications. Such misclassifications can result in safety violations. However, not all errors result in safety violations, and so it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute-intensive task.
Second, we proposed BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, by leveraging the DNN’s properties.
Finally, we proposed Ranger, an approach to protect DNNs from critical faults without causing any loss in their accuracies, and with minimal performance overheads. I will conclude by presenting some of our ongoing work as well as the future challenges in this area. This is joint work with my students and colleagues at the University of British Columbia, as well as industry collaborators.