Classification of Spam Content in Turkish Short Messages Using Transformer-Based Models

doi:10.62520/fujece.1892277

DOI: 10.62520/fujece.1892277 ISSN: 2822-2881

Classification of Spam Content in Turkish Short Messages Using Transformer-Based Models

Buğra Kıvrak, Mehmet Arzu, Mahmut Kaya, Murat Aydoğan, Yunus Santur

Short Message Service is one of the most widely used communication tools today. This widespread use has also led to an increase in the number of unwanted messages. These messages, sent for the purposes of fraud, advertising, and promotion, are messages that users do not wish to receive. This study addresses the classification of messages received via SMS into spam and non-spam categories. To this end, a project was developed using a specialized SMS dataset. Preprocessing steps were applied to the dataset to remove meaningless features from the text, and then two different classical machine learning algorithms and two different deep learning algorithms were used. These algorithms are Naive Bayes, Random Forest, BERTurk, and Turkish ELECTRA, respectively. Classification models were developed using these algorithms. During the training process, the dataset was split into an 80% training set and a 20% test set; additionally, a 5-fold cross-validation method was applied to verify the stability of the results. The models were trained on the preprocessed data and analyzed and compared using performance metrics such as Precision, Recall, F1-Score, and Accuracy. Analyses conducted on the dataset indicate that the unprocessed BERTurk model, evaluated using 5-fold cross-validation, achieved the best result with an accuracy of 0.990. The results demonstrate that different algorithms offer distinct advantages depending on the data structure.

Outline

Classification of Spam Content in Turkish Short Messages Using Transformer-Based Models

More from our Archive