Presentation
Scrutinizing Variables for Checkpoint Using Automatic Differentiation
DescriptionCheckpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable time and system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage and compute efficiency. We propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) necessary for checkpointing. This allows us to identify critical and uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable necessary for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We validate our approach with all benchmarks from the NPB suite. We visualize the distribution of critical and uncritical elements within a variable with respect to its binary impact (yes or no) on the application output.