DOI: 10.1049/cvi2.70073 ISSN: 1751-9632

OSS–CAEA: Bridging Vision and Language for Open‐Vocabulary Semantic Segmentation via Collaborative Attention and Embedding Alignment

Yue Sun, Zihui Zhao, Rui Zhao

ABSTRACT

Open‐vocabulary semantic segmentation aims to recognise arbitrary categories beyond a fixed label set, yet existing vision–language methods often struggle with stable pixel‐level alignment and suffer feature drift under domain shifts. We propose OSS–CAEA, a framework that freezes the CLIP image and text encoders, Depth Anything V2 and a visual foundation model (VFM), introducing a few learnable adaptation modules. The image passes through three parallel branches: CLIP extracts semantic representations, Depth Anything V2 provides geometric cues and the VFM captures spatial structures, whereas text is encoded by CLIP and projected to obtain shared textual priors. The collaborative attention module (CAM) fuses multiscale features from Depth Anything V2 and the VFM under GeoText Prompt (GTP) guidance, enhancing geometry–semantic consistency and alleviating unstable responses caused by single‐layer proxies and appearance variations. CLIP features and CAM outputs are fused, reshaped and fed into a coarse segmentation head to produce supervised predictions that provide spatial and category priors for refinement. Guided by these priors, the pixel–semantic alignment head (PSAH) narrows the gap between pixel semantics and language descriptions, reducing visual–language discrepancy and improving robustness for unseen categories. Experiments on multiple datasets show OSS–CAEA consistently outperforms existing methods; ablations validate each component.

More from our Archive