In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome

doi:10.1145/3797112

DOI: 10.1145/3797112 ISSN: 2994-970X

In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome

Ardi Madadi, Seongmin Lee, Cornelius Aschermann, Marcel Böhme

In Google’s FuzzBench platform, we find that the outcome of coverage-based evaluation more strongly agrees with the outcome of a bug-based evaluation than an independent bug-based evaluation itself. Recently, B'ohme et al. found that despite a very strong correlation between coverage achieved and bugs found, there is no strong agreement between the outcome of a coverage- and a bug-based evaluation: The fuzzer best at achieving coverage may be the worst at finding bugs. However, in trying to explain this moderate agreement, we wondered whether the outcome of bug-based benchmarking itself is perhaps much more “noisy” and turned to applied statistics to develop the tools necessary to investigate our hypothesis.

In this paper, we call this degree of “noisiness” of a benchmarking outcome the concordance of the benchmarking procedure and quantify it using a measure of statistical reliability widely used in psychology, called mean split-half reliability , i.e., the expected agreement on the benchmark outcome between two random halves of the benchmarking suite. In our experiments with FuzzBench and Magma, we find that the concordance of coverage-based benchmarking is consistently strong while that of bug-based benchmarking is weak on FuzzBench and moderate on Magma. In contrast to FuzzBench, for the Magma benchmark suite (which was designed for bug-based evaluation) a coverage-based evaluation does not predict the outcome of a bug-based evaluation better than an independent bug-based evaluation.

Moreover, to demonstrate the utility of concordance also for developers of benchmarking suites, we investigate concordance as a measure of benchmarking efficiency, as in green fuzzer benchmarking. We empirically confirm that the resources of a procedure with higher concordance can be reduced more substantially (in terms of campaign length or benchmark sampling size) while maintaining a similar benchmark outcome as a procedure with lower concordance. We report the corresponding savings in terms of carbon emissions.

Outline

In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome

More from our Archive