DOI: 10.3390/electronics15132855 ISSN: 2079-9292

A Speech Emotion Recognition Model Combining WavLM Pre-Trained Features and Attention Mechanism

Xuhong Huang, Yinan Wang, Huiling Huang, Chaohui Zhou, Xuanyu Zhou, Yongbin Chen

Speech emotion recognition is a key technology for achieving intelligent human–computer interaction. Traditional methods rely on manually designed features and struggle to fully capture the complex emotional information in speech. To address this issue, this paper proposes a novel speech emotion recognition model that integrates the WavLM pre-trained model with a multi-head attention mechanism. The model first utilizes the WavLM self-supervised pre-trained model to extract deep acoustic features. Then, a bidirectional long short-term memory network (BiLSTM) captures the temporal dependencies in the feature sequence. Finally, a multi-head attention mechanism is introduced to adaptively focus on emotionally salient time segments, generating a more discriminative emotional feature representation. Experimental results on two public emotion datasets, RAVDESS and EmoDB, show that the proposed model achieves an unweighted accuracy of 97.22% and a weighted accuracy of 96.88% on RAVDESS, and 96.06% unweighted accuracy and 95.33% weighted accuracy on EmoDB, significantly outperforming various baseline models, thus validating the model’s effectiveness and generalization ability.

More from our Archive