DOI: 10.54287/gujsa.1870409 ISSN: 2147-9542

Uncertainty-Gated Dual-Branch Additive–Attention Network for Robust and Calibrated Tabular Classification Under Missingness

Mücahit Cihan
Tabular deep learning is still challenging in real-world settings. Many datasets include both numerical and categorical variables, substantial missingness, and a need for not only strong classification performance but also interpretability and reliable probability estimates. DA2-Net is proposed to address this problem through a dual-branch architecture. It combines an interpretable additive pathway for feature-wise main effects with a selective self-attention pathway for higher-order interactions. In this design, features are ranked using additive contribution magnitude, uncertainty, and missingness-aware scaling. Only a Top-K subset is then passed to a single multi-head self-attention block. The final prediction is obtained through uncertainty-aware gated fusion. The model is also supported by sparsity, stability, and Brier-based calibration regularization. This allows it to balance expressive interaction modeling with transparency and robustness under incomplete data. DA2-Net is evaluated on four public binary tabular benchmarks, namely AdultIncome, DefaultCredit, HeartDisease, and BankMarketing, under controlled Missing Completely At Random (MCAR) missingness levels of 0.0, 0.1, 0.2, and 0.3. The evaluation uses 5-fold stratified cross-validation repeated across three random seeds. This produces 15 runs for each dataset and missingness condition, and 128 evaluation blocks in total across AUC, AUPRC, ACC, F1, sensitivity, specificity, Brier score, and Expected Calibration Error (ECE). Across this benchmark, DA2-Net achieves the best overall mean rank with 3.078 ± 2.044, ahead of SAINT-Lite at 3.980 ± 2.624. It achieves or shares the best result in all 16 AUC blocks, 13 of 16 AUPRC blocks, 10 of 16 ACC blocks, 11 of 16 Brier blocks, and 7 of 16 ECE blocks. These results show that its main strength lies in robust ranking-based discrimination and strong overall probability quality under missingness. It also shows a favorable practical-efficiency profile in the current benchmark, remaining more compact and inference-efficient than the main transformer-like baselines. Epoch-wise loss analysis also shows stable convergence across all four datasets. The binary cross-entropy (BCE) term drives the optimization, while the auxiliary regularizers act as controlled refinements. The ablation study further confirms that the interaction branch is essential. Removing it in the AdditiveOnly variant causes the clearest degradation in both predictive and calibration metrics. In contrast, removing the gate or the auxiliary regularization terms leads only to minor changes. A sensitivity analysis also supported the selected interaction subset size k=10 and spline knot count K=8 as balanced settings, while additive shape-function visualizations provided direct qualitative evidence for feature-wise interpretability.

More from our Archive