DOI: 10.35377/saucis...1830726 ISSN: 2636-8129

Modern Deep Learning Architectures for Urban Sound Classification on UrbanSound8K

Ulaş Yurtsever
Environmental sound classification (ESC) is critically important for monitoring noise pollution and ensuring urban safety in smart city applications. Although deep learning–based approaches have achieved high performance in this domain, many studies in the literature rely on randomly partitioned datasets that cause data leakage or require massive pretraining corpora with high computational cost (e.g., AudioSet). In this study, we propose a methodologically robust and computationally efficient approach for environmental sound classification on the UrbanSound8K dataset. To ensure the reliability of the results, we adopt the Official 10-Fold Cross-Validation protocol, which is considered the most challenging evaluation scheme in the literature. In our experiments, the Vision Transformer (ViT) architecture is compared with modern CNN architectures. In addition, the impact of data augmentation techniques such as MixUp and SpecAugment on these architectures is analyzed. The results show that under the Official 10-fold protocol, ConvNeXt-Tiny achieves the best mean accuracy, reaching 83.94% with MixUp and 82.81% with the combined SpecAugment+MixUp setting, while ViT attains 81.94% under SpecAugment+MixUp. In contrast, Random splitting artificially inflates performance to 98.06% due to leakage, underscoring the need for the Official, leakage-free protocol.

More from our Archive