DOI: 10.1142/s0219649226500528 ISSN: 0219-6492

Arabic Dialect Identification in Romanised Text (Arabizi)

Seifeddine Mechti, Eya Hammami, Shayma Boukari, Rim Faiz

The need for Arabic Dialect Identification (ADI) is increasing because of the large amount of spoken language in Arabic-speaking countries. While most studies have focused on written dialects using Arabic Script, very little has been done to identify the dialect in Romanised Arabic (Arabizi), which contains many irregular spellings, varied vocabulary, and mixed language (code-switching). In this study, we attempt to address this problem through developing a hybrid approach using both character-level Convolutional Neural Networks (CNNs) and embeddings based on transformers that were fine-tuned on Arabizi data. Additionally, we developed a normalisation process to handle differences in orthography. Our experiments with several social media datasets showed our proposed system had an [Formula: see text]-score of 0.87 compared with other approaches at 6% higher. We provide additional resources, including annotations for these datasets, and offer some methodological guidelines for working with Romanised dialectal Arabic. Finally, we describe specific applications of this work, such as sentiment analysis and machine translation for content created by younger speakers and migrant communities.

More from our Archive