DOI: 10.1145/3711713 ISSN: 2836-8924

Applications of Certainty Scoring for Machine Learning Classification and Out-of-Distribution Detection

Alexander M. Berenbeim, Adam D. Cobb, Anirban Roy, Susmit Jha, Nathaniel D. Bastian

Quantitative characterizations and estimations of uncertainty are of fundamental importance for machine learning classification, particularly in safety-critical settings where continuous real-time monitoring requires explainable and reliable scoring. Reliance on the maximum a posteriori principle to determine label classification can obscure the certainty of a label assignment. We develop a theoretical framework for quantitative scores of certainty and competence based on predicted probability estimates, formally prove their properties, and empirically confirm the inferential power of these properties across different data modalities, tasks and model architectures. Our theoretical results establish that competent models have distinct distributions of certainty for true and false positives conditioned on inputs similar to training and testing data, and prove that this framework provides a reliable means to infer the quality of model predictions and detect false positives. Our empirical results bear out that there are distinct distributions of certainty scores on training and holdout data, as well as data that is a priori out-of-distribution. For expert models, at least 62.1% of false positives could be identified when using a cut-off at at the bottom 5% TP threshold. Further, we found a strong negative correlation between empirical competence and the FPR95TPR rate for EnergyBased out-of-distribution (OOD) detectors. Finally, we developed two forms of an OOD detector that were able to reliably distinguish in-distribution data from OOD data for both frequentist and Bayesian models, performing better on average than previous state-of-the-art EnergyBased OOD detection methods, and improving upon the baseline Monte Carlo Dropout AUPR-OUT performance on average by 14.4% and 16.5%, and reducing the FPR95TPR by 54.2% and 37.6%.

More from our Archive