DOI: 10.3390/make8070179 ISSN: 2504-4990

Calibration, Architecture, and Distribution Shift in Predictive Uncertainty Estimation

Dina Šabanović, Tea Krčmar, Zdravko Krpić, Ivica Lukić

Reliable uncertainty quantification matters for high-stakes tabular classification, yet the comparative influence of architecture and post-hoc calibration on uncertainty quality remains underexplored, particularly outside in-distribution conditions. We present a matched-protocol benchmark on 36 OpenML-CC18 datasets comparing three GBDTs, a single MLP, MC-Dropout, and deep ensembles of three, five, and ten members under five post-hoc calibration methods, with additional evaluation under Gaussian feature shift, symmetric label noise, split-conformal coverage, and a sub-comparison against TabPFN v2. All paired comparisons use Wilcoxon signed-rank tests with Holm corrections within pre-specified research questions at α=0.05. Using this benchmark, calibrator choice had a larger practical influence on uncertainty metrics than the difference between calibrated GBDTs and five-member neural ensembles. We observe a clear architecture-by-calibrator interaction: Dirichlet calibration was generally strongest on the evaluated GBDTs under low-to-moderate class imbalance, whereas temperature scaling was generally strongest on the evaluated neural models. Under controlled covariate shift, the in-distribution ordering reversed, with multinomial logistic recalibration showing the strongest performance among the tested calibrators on the evaluated neural models under heavy perturbation, and under high class imbalance, the preference for Dirichlet calibration on the evaluated GBDTs weakened. GBDTs were run at recommended defaults rather than tuned per dataset, and the covariate-shift protocol uses synthetic Gaussian noise rather than naturalistic out-of-distribution data; the shift-related findings should be read as directional indicators within this protocol. Calibration and architecture should therefore be selected jointly. The preferred calibrator depends on the model family in-distribution and changes again under perturbation.

More from our Archive