MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework
Anusha Chhabra, Dinesh VishwakarmaSocial media makes a substantial influence on individuals’ lives. Hate speech on social media has become one of society's most significant challenges in recent years. Text and images are two types of multimodal data that are used to create the content that is shared on social media platforms that are currently available. Earlier approaches have mostly focused on single-modality analysis. Furthermore, in performing multimodal analysis, researchers underperform to maintain the distinct characteristics inherent to every modality. This study proposes a scalable framework for multimodal hate content detection, termed transformer-based multilevel attention (STMA), to address these shortcomings. The three primary components of this architecture are a vision attention-mechanism encoder, a caption attention-mechanism encoder, and a combined attention based deep learning mechanism. Each component interprets multimodal input differently and uses different attention mechanisms to detect hate content. The proposed research makes use of three datasets: MultiOff, Hateful memes, and MMHS150K. These datasets are utilised to develop evaluation criteria for classifying hate speech, validate the efficacy of the suggested architecture, and achieve accuracy scores of 0.6509, 0.8790, and 0.8076, respectively. These scores indicate a significant improvement in performance in comparison to other state-of-the-art methods. The area under the curve (AUC) scores of 0.6475, 0.8376, and 0.8069 also reveal a significant improvement in performance when compared to results from previous studies. The results indicate that across all three datasets, the proposed method outperforms the baseline approaches.