A novel deep learning joint visual–acoustic representation framework for bird species classification using AVFusionNet
Bhubneshwar Sharma, Ajay Kumar, Randhir Singh, Parveen Kumar LehanaAbstract
Automated bird species classification is a critical component of biodiversity assessment and ecological monitoring; however, conventional unimodal systems often suffer significant performance degradation in noisy and visually challenging environments. This paper presents AVFusionNet, a novel noise-aware multimodal deep learning framework that integrates visual and acoustic cues for robust bird species recognition. The proposed system employs dual convolutional neural network (CNN)-based feature extractors to learn spatial representations from bird images and temporal–spectral representations from log-Mel spectrograms of vocalizations. A key contribution is an adaptive feature fusion strategy that dynamically prioritizes the more reliable modality under adverse environmental conditions, thereby enhancing robustness and decision stability. A custom multimodal dataset comprising four bird species – Chicken, Crow, Duck, and Parrot – was constructed and evaluated under 10 controlled scenarios involving varying levels of image degradation and acoustic noise. Experimental results show that the proposed model achieves 98.73 % classification accuracy under ideal conditions and maintains performance above 80 % across all degraded scenarios, with accuracy ranging from 82.50 to 93.75 % under visual degradation, 81.25–98.73 % under acoustic noise, and 80–95 % under simultaneous multimodal degradation. These results demonstrate that AVFusionNet provides robust and consistent performance, significantly outperforming conventional CNN, support vector machine (SVM), and random forest approaches, and is well suited for real-time ecological monitoring and intelligent wildlife surveillance applications.