DOI: 10.3390/life16071108 ISSN: 2075-1729

Explainable Multi-Modal Deep Learning for Recording-Level Classification of Respiratory Audio Signals Under Internal and Domain-Shift Evaluation

S M Asiful Islam Saky, Md Saiful Arefin, Md Rashidul Islam, Mohammad Saiful Islam, Rashadul Islam Sumon, Md Mostafizur Rahman Masud, Maria Lapina, Mikhail Babenko, Mohammed Muthanna

Respiratory diseases are a major global health challenge. However, identification of respiratory diseases is often limited by subjectivity, environmental noise and inter-clinician variability. This study presents an explainable multimodal deep learning framework for recording-level multiclass classification of respiratory audio signals. The proposed system integrates two complementary representations—a spectro-temporal encoder based on a CNN–BiLSTM-attention architecture and a handcrafted acoustic-feature encoder capturing acoustic descriptors commonly used in respiratory-audio analysis, including MFCCs, zero-crossing rate, spectral centroid, spectral bandwidth, chroma, RMS energy, and spectral rolloff features. These branches are combined through late-stage fusion to leverage both data-driven representation learning and domain-informed acoustic cues. The proposed model was trained and internally evaluated on the Asthma Detection Dataset Version 2, comprising five respiratory categories: bronchial disease, asthma, COPD, healthy, and pneumonia. Mono conversion, resampling to 16 kHz, 100–2000 Hz band-pass filtering, amplitude normalisation, fixed 4 s trimming or zero-padding, training-only augmentation, handcrafted-feature extraction, mel-spectrogram generation, quality control auditing, and stratified recording-level partitioning have been applied in the pre-processing steps. Across five repeated experiments with different random seeds, the proposed hybrid model achieved a mean held-out recording-level test accuracy of 0.9099±0.0163, balanced accuracy of 0.8936±0.0152, macro F1-score of 0.8937±0.0177, macro ROC–AUC of 0.9867±0.0010, and macro PR–AUC of 0.9489±0.0044. Conventional machine learning baseline comparisons showed that the proposed model achieved stronger internal accuracy, balanced accuracy, macro recall, macro F1-score, and macro ROC–AUC than classical machine learning algorithms trained on handcrafted acoustic features, although Random Forest remained competitive in macro PR–AUC. Ablation analysis shows that the deep spectro-temporal branch was the primary contributor to predictive performance, while the handcrafted branch provided complementary interpretable acoustic information rather than consistently improving all classification metrics. Explainability was incorporated using Grad-CAM and Integrated Gradients for spectrogram-based interpretation and SHAP for handcrafted-feature attribution. Domain-shift evaluation on the ICBHI Respiratory Sound Database and a COPD-focused cohort revealed substantial dataset shift effects, including poor healthy-case recognition on ICBHI and seed-dependent COPD recognition in the COPD-focused cohort. Identifier-aware sensitivity analyses showed lower performance than the main recording-level split, suggesting that subject-like or source-level overlap may inflate internal performance estimates. The findings should be interpreted as promising internal held-out recording-level algorithmic performance with limited external transfer, rather than evidence of readiness for clinical use.

More from our Archive