Presentation
Designing Quality MPI Correctness Benchmarks: Insights and Metrics
DescriptionSeveral MPI correctness benchmarks have been proposed to evaluate the quality of MPI correctness tools.
The design of such a benchmark comes with different challenges, which we will address in this paper.
First, an imbalance in the proportion of correct and erroneous codes in the benchmarks requires careful metric interpretation (recall, accuracy, F1 score).
Second, tools that detect errors but do not report additional information, like the affected source line or class of error, are less valuable.
We extend the typical notion of a true positive with stricter variants that consider a tool's helpfulness.
We introduce a new noise metric to consider the amount of distracting error reports.
We evaluate those new metrics with MPI-BugBench, on the MPI correctness tools ITAC, MUST, and PARCOACH.
Third, we discuss the complexities of hand-crafted and automatically generated benchmark codes and the additional challenges of non-deterministic errors.
The design of such a benchmark comes with different challenges, which we will address in this paper.
First, an imbalance in the proportion of correct and erroneous codes in the benchmarks requires careful metric interpretation (recall, accuracy, F1 score).
Second, tools that detect errors but do not report additional information, like the affected source line or class of error, are less valuable.
We extend the typical notion of a true positive with stricter variants that consider a tool's helpfulness.
We introduce a new noise metric to consider the amount of distracting error reports.
We evaluate those new metrics with MPI-BugBench, on the MPI correctness tools ITAC, MUST, and PARCOACH.
Third, we discuss the complexities of hand-crafted and automatically generated benchmark codes and the additional challenges of non-deterministic errors.