Presentation
Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications
DescriptionRun-by-run variability in parallel programs caused by floating-point non-associativity can affect reproducibility in iterative algorithms due to accumulating errors, and correctness testing for stochastic programs. The sensitivity of deep learning (DL) training and inference to non-determinism can be extreme, and can prevent certification, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled DL with simulation, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of floating-point non-associativity within HPC programming models, analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs, and examine the recently-added deterministic options within the PyTorch framework within the context of GPU deployment. We evaluate the strategy of exploiting automatic determinism provided by deterministic hardware, using the Groq LPU accelerator for inference portions. We demonstrate the benefits that this strategy can provide within reproducibility and correctness efforts.