Abusive Language Detection in Khasi Social Media Comments
Arup Baruah, Lakhamti Wahlang, Firstbornson Jyrwa, Floriginia Shadap, Ferdous Barbhuiya, Kuntal DeyThis paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.