Close

Presentation

Tools to Diagnose and Repair Floating-Point Errors in Heterogeneous Computing Hardware and Software
DescriptionFloating-point arithmetic is central to HPC and ML, with the variety of number formats, hardware platforms, and compilers exploding in this era of heterogeneity. This unfortunately increases the incidence of numerical issues including exceptions such as Infinity and NaN that can render the computed results unreliable or change control-flows, introduces excessive rounding that breaks the assumptions made in the numerical algorithm in use, and overall causes result non-reproducibility when code is optimized or ported across platforms. In this tutorial, we present three novel tools: (1) GPU-FPX, which exposes silent exceptions in NVIDIA GPU computations, (2) Ciel, which pinpoints where compilers silently over-optimize and cause non-reproducibility, and (3) Herbie, which improves the accuracy of a programmer-written expression, significantly reducing rounding error or eliminating exceptions. This half-day tutorial will consist of (1) presentations of floating-point basics, (2) demos of all our numerical debugging tools, presenting their principle of operation and ideal usage contexts, and (3) plenty of time for Q/A, especially on using these tools within the organization of the attendees. New and emerging technologies such as Tensor Cores will be introduced by showing how to test for non-portability of codes across them.
Event Type
Tutorial
TimeSunday, 17 November 20241:30pm - 5pm EST
LocationB214
Tags
Debugging and Correctness Tools
Emerging Technologies
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Numerical Methods
Registration Categories
TUT