DOI: 10.1177/1088467x261459964 ISSN: 1088-467X

Contrastive learning with hybrid data augmentation and pseudo-label supervision for short text clustering

Sen Xu, Tao Yan, Shanliang Yao, Naixuan Guo, Xuesheng Bian, Xiufang Xu, Xianye Ben, Tian Zhou

Contrastive learning has become a powerful paradigm for unsupervised representation learning. However, its effectiveness largely depends on carefully designed data augmentation strategies to generate meaningful positive and negative pairs. Additionally, unsupervised clustering algorithms are typically sensitive to initialization and prone to converging to suboptimal local minima, resulting in unstable performance. To overcome these challenges, we propose HAPL, a unified end-to-end framework for short text clustering that integrates Hybrid data Augmentation with Pseudo-Label supervision. HAPL combines explicit and implicit data augmentation techniques in a synergistic strategy. It also incorporates an adaptive optimal transport mechanism for pseudo-label generation. This design provides principled supervision that stabilizes the optimization process and adapts to varied cluster distributions, thereby enhancing the model’s discriminative power. Furthermore, prototype learning is employed to reinforce the coherence of representations in the embedding space. Extensive experiments on eight benchmark datasets show that HAPL achieves state-of-the-art performance across various evaluation metrics. Comprehensive ablation experiments validate the contribution of each component to the overall effectiveness of the framework.

More from our Archive