DOI: 10.3390/electronics15132770 ISSN: 2079-9292

A Contrastive and Uncertainty–Aware Framework for Multimodal Named Entity Recognition

Xiao Yang, Ruixue Zhao, Honglei Li

Multimodal named entity recognition (MNER) aims to improve entity detection in social media texts by leveraging accompanying images, but its performance is often affected by weak text–image alignment, noisy or irrelevant visual content, and limited separation among entity representations. To address these issues, this study proposes CUA-MNER, a contrastive uncertainty–aware framework that combines hierarchical vision–text alignment, variational uncertainty–aware fusion, and token-level contrastive learning. The alignment module models correspondences at token, phrase, and sentence levels, allowing local visual regions and global image context to support textual entity recognition. The fusion module estimates epistemic and aleatoric uncertainty through variational inference and adaptively adjusts the contribution of each modality for different samples. The contrastive objective further encourages entity representations of the same type to be closer while separating different entity types. Experiments on the Twitter2015 and Twitter2017 benchmarks demonstrate that CUA-MNER achieves F1 scores of 76.97% and 89.66%, respectively, outperforming competitive baselines by 0.66 and 1.95 F1 points. Ablation and diagnostic analyses show that the three components provide complementary benefits. These results suggest that modeling modality reliability is useful for robust MNER, while the additional modules also introduce computational overhead and leave cross-domain generalization as an open issue.

More from our Archive