DOI: 10.7717/peerj-cs.4021 ISSN: 2376-5992

Arabic speech emotion recognition (2015–2024): a systematic review of datasets, dialects, and classification methods

Mohannad Akram Alkhalili, Norisma Binti Idris, Aznul Qalid Md Sabri, Ayham Alomari, Noor M. Alkudah, Manjeevan Singh Seera

Background

Speech Emotion Recognition (SER) is an important component of Human–Computer Interaction (HCI), including applications such as mental health monitoring, adaptive learning systems, and smart environments. Research in Arabic SER, however, remains constrained by dataset limitations, dialectal variation, and inconsistent evaluation practices.

Methodology

This review systematically examines Arabic SER research published between 2015 and 2024. A PRISMA-guided process was used to identify 83 eligible studies across major academic databases. We analyze 24 emotional speech datasets in terms of dialectal coverage, emotional categories, speaker demographics, and recording methodologies (acted, semi-natural, and natural). We also review classification approaches used in the field, including Classical Machine Learning (CML), Deep Learning (DL), and more recently, transformer-based self-supervised learning (SSL) models, which represent a specialized class of deep learning approaches.

Results

The review reveals substantial variability in dataset design, annotation practices, and evaluation protocols. Most datasets are acted and dominated by a small set of emotions, with limited representation of spontaneous speech, nuanced affective states, and underrepresented dialects such as Levantine varieties. Speaker metadata is inconsistently reported, and many datasets are not publicly accessible, restricting reproducibility. Recent modelling trends show a transition from handcrafted-feature approaches to Deep Learning and Self-Supervised Learning, yet the lack of standardized benchmarks prevents meaningful comparison across studies.

Conclusions

Arabic SER research has advanced in methodological diversity and modelling capabilities, but structural limitations in dataset availability, dialect representation, and evaluation standards continue to impede progress. Developing dialect-inclusive, openly available emotional speech corpora with transparent metadata, balanced emotion coverage, and unified benchmarking protocols is essential for supporting robust, generalizable SER systems.

More from our Archive