Presentation
GVARP: Detecting Performance Variance on Large-Scale Heterogeneous System
SessionPerformance Analysis
DescriptionPerformance variance is one of the nasty pitfalls of large-scale heterogeneous systems, which can lead to unexpected and unpredictable performance degradation for parallel programs. Such performance issues typically arise from various random hardware and software faults, making it exceedingly difficult to pinpoint the exact causes of performance variance in specific instances. In this paper, we propose \textit{GVARP}, a performance variance detection tool for large-scale heterogeneous systems. \textit{GVARP} employs static analysis to identify the performance-critical parameters of kernel functions. Additionally, \textit{GVARP} segments the program execution with external library calls and asynchronous kernel operations. Then \textit{GVARP} constructs a state transfer graph and estimates the workload of each program segment to identify and cluster instances of similar workloads, facilitating the detection of performance variance. Our evaluation results demonstrate that \textit{GVARP} effectively detects performance variance at a large scale with acceptable overhead and provides intuitive insights to locate the sources of performance variance.


