DOI: 10.46810/tdfd.1828359 ISSN: 2149-6366

Scene Classification via Attention-Guided Integration of Visual and Auditory Data Streams

Yusuf Çelik
This study proposes a novel multi-source deep learning architecture, called the Gated Cross-Modal Fusion Transformer (GCM-FT), designed to more effectively integrate the complementary structure of visual and auditory information sources in scene classification. The proposed framework extracts deep representations from the visual stream using an EfficientNetV2 backbone, while processing the MFCC-based time–frequency features provided within the dataset for the auditory stream. The representation vectors obtained from both streams are dynamically unified through a gated attention mechanism. With its multi-headed loss function, auxiliary stream outputs, and attention-based fusion block, the model is able to learn the contributions of visual and auditory information in a stable and balanced manner. Extensive cross-validation experiments demonstrate that GCM-FT achieves higher accuracy, lower variance, and more consistent class-wise performance compared with single-stream models and existing fused-information approaches.These findings indicate that attention-guided fusion offers a powerful and generalizable information integration strategy for visual–auditory scene classification tasks.

More from our Archive