DOI: 10.17678/beuscitech.1920012 ISSN: 2146-7706

Transformer-Based Otitis Media Classification: A Comparative Study of ViT, DeiT, and PVT Architectures in Otoscopic Image Analysis

Sedat Örenç
Otitis media (OM) is one of the most prevalent middle ear diseases globally. It represents a major clinical and public health burden. It imposes significant health and socio-economic challenges, particularly among pediatric populations. Conventional diagnostic tools, such as otoscopy and tympanometry, are constrained by inherent subjectivity and operator dependency. Furthermore, they suffer from low specificity. This underscores the critical need for reliable, scalable, and automated diagnostic solutions. Recent advances in deep learning have improved image-based diagnosis; however, traditional Convolutional Neural Networks (CNNs) remain constrained by their inability to capture long-range dependencies essential for distinguishing subtle tympanic membrane pathologies. To address these limitations, this study systematically benchmarks three transformer-based architectures: Vision Transformer (ViT), Data-Efficient Image Transformer (DeiT), and Pyramid Vision Transformer (PvT). The objective is to automate the classification of otoscopic images. Classification targets include Acute Otitis Media (AOM), Chronic Otitis Media (COM), and normal cases. A balanced dataset of 1,800 images was curated and augmented. This ensured a fair evaluation under standardized training conditions. The experimental results demonstrate that ViT performs well for AOM, achieving an accuracy and F1-score of 0.98. However, its performance declines for chronic and normal cases. In contrast, DeiT produces the most consistent results across all categories, achieving near-perfect accuracy of 1.00 for acute OM cases and 0.96 for chronic and normal cases. PvT also demonstrates strong performance, achieving an accuracy of 1.00 for OM and 0.99 for normal. These findings demonstrate the superior robustness and clinical potential of DeiT and PvT compared to ViT, suggesting their suitability for real-world applications. Beyond delivering a reproducible benchmarking framework, this work contributes toward bridging the gap between algorithmic innovation and clinical translation. This provides a way of achieving reliable, interpretable and scalable AI-assisted diagnosis of otitis media.

More from our Archive