DOI: 10.46460/ijiea.1887612 ISSN: 2587-1943

TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish

Mustafa Eriş, Fatma Güneş Eriş, Erhan Akbal
Speech emotion recognition is a growing area focused on enhancing human-computer interaction by precisely recognizing emotions from speech signals. In recent years, advancements in deep learning have led to highly successful studies on speech emotion recognition in the literature. Especially, audio embeddings obtained from self-supervised models have significantly improved emotion recognition performance by capturing rich and meaningful representations of speech signals. While notable advancements have been achieved with self-supervised learning models for languages like English and German, high-quality datasets are still missing for other languages, such as Turkish. This research introduces a new Turkish SER dataset, TURK-SER, which includes 2150 recordings of phonetically varied sentences produced by 90 speakers across five emotional categories. Furthermore, we explore how to adapt the Wav2Vec2 model for Turkish SER using two fine-tuning methods: half-fine tuning, which only updates the transformer encoder, and full-fine tuning, which trains both the convolutional and transformer encoders. Experimental findings indicate that full fine-tuning enhances classification performance, reaching an accuracy of 85.44%. These findings underscore the promise of Wav2Vec2 for SER in low-resource languages and offer valuable insights into optimizing self-supervised learning-based models for emotion detection. This research highlights the effectiveness of Wav2Vec2 in Turkish SER and paves the way for future studies to investigate its applicability across other low-resource languages.

More from our Archive