Close

Presentation

Fault-Tolerant Numerical Iterative Algorithms at Scale
DescriptionNumerical iterative algorithms are struck by multiple error types when deployed on large-scale HPC platforms: fail-stop errors (failures) and silent errors, striking both as computation errors and memory bit-flips. Our novel approach provides efficient fault-tolerant algorithms that are capable of detecting and correcting them simultaneously. Previous works never addressed all the error types simultaneously.

We introduce a hierarchical periodic pattern combining various general-purpose and application-specific techniques and optimize its shape in order to minimize the expected time per iteration. The derivation is intricate because optimizing a resilience period for one error type depends upon other errors possibly striking and slowing down execution progress.

A case study with the preconditioned conjugate gradient algorithm (PCG) demonstrates the good performance and flexibility of our approach, which easily adapts to different application and fault-tolerance parameter costs (e.g. iteration, verification, checkpoint, etc.).

Future work: extension to include more case studies.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
Registration Categories
TP
XO/EX