Confidence-Guided Fusion for Self-Supervised Monocular Depth Estimation in Endoscopy
Shuang Li, Hongbo Wang, Zhaoxu Hu, Tian Chu, Yingping Li, Liang ZhaoAccurate monocular depth estimation (MDE) is a foundational task in endoscopic surgery, critical for augmenting depth perception and aiding surgical navigation. While diffusion-based and discriminative depth estimators demonstrate complementary strengths, they also exhibit asymmetric errors: discriminative models yield precise geometric boundaries but struggle in homogeneous or saturated areas, whereas diffusion models recover fine textures at the cost of occasional structural incoherence. To systematically exploit this complementarity, we present CoDepth, a novel framework that leverages confidence-guided fusion to harmonize the outputs of these heterogeneous estimators. Its core components include a complementary map extractor that identifies structured disparity disagreements, a cross-attention module for context-aware feature integration, and a probabilistic confidence network that generates spatially adaptive fusion weights. Extensive evaluations on the SCARED dataset show that CoDepth achieves improved overall performance relative to strong single-model baselines, with the most consistent gains observed in Abs Rel and δ-based accuracy, while changes in some other error metrics are more modest. Furthermore, CoDepth exhibits encouraging cross-domain generalization. When a model trained on SCARED is directly evaluated on SERV-CT, Hamlyn, and C3VD without fine-tuning, it achieves competitive performance and improves several key metrics across datasets. The framework also demonstrates enhanced robustness against common synthetic corruptions like low-light conditions, Gaussian noise, and impulse noise, underscoring its practical utility in complex clinical settings. These results suggest that confidence-guided complementary fusion provides a practical integration-level paradigm for combining heterogeneous endoscopic depth estimators.