Explainable Multi-Modal Deep Learning for Recording-Level Classification of Respiratory Audio Signals Under Internal and Domain-Shift Evaluation

doi:10.3390/life16071108

DOI: 10.3390/life16071108 ISSN: 2075-1729

Explainable Multi-Modal Deep Learning for Recording-Level Classification of Respiratory Audio Signals Under Internal and Domain-Shift Evaluation

S M Asiful Islam Saky, Md Saiful Arefin, Md Rashidul Islam, Mohammad Saiful Islam, Rashadul Islam Sumon, Md Mostafizur Rahman Masud, Maria Lapina, Mikhail Babenko, Mohammed Muthanna

Show PDF Cite

Respiratory diseases are a major global health challenge. However, identification of respiratory diseases is often limited by subjectivity, environmental noise and inter-clinician variability. This study presents an explainable multimodal deep learning framework for recording-level multiclass classification of respiratory audio signals. The proposed system integrates two complementary representations—a spectro-temporal encoder based on a CNN–BiLSTM-attention architecture and a handcrafted acoustic-feature encoder capturing acoustic descriptors commonly used in respiratory-audio analysis, including MFCCs, zero-crossing rate, spectral centroid, spectral bandwidth, chroma, RMS energy, and spectral rolloff features. These branches are combined through late-stage fusion to leverage both data-driven representation learning and domain-informed acoustic cues. The proposed model was trained and internally evaluated on the Asthma Detection Dataset Version 2, comprising five respiratory categories: bronchial disease, asthma, COPD, healthy, and pneumonia. Mono conversion, resampling to 16 kHz, 100–2000 Hz band-pass filtering, amplitude normalisation, fixed 4 s trimming or zero-padding, training-only augmentation, handcrafted-feature extraction, mel-spectrogram generation, quality control auditing, and stratified recording-level partitioning have been applied in the pre-processing steps. Across five repeated experiments with different random seeds, the proposed hybrid model achieved a mean held-out recording-level test accuracy of 0.9099±0.0163, balanced accuracy of 0.8936±0.0152, macro F1-score of 0.8937±0.0177, macro ROC–AUC of 0.9867±0.0010, and macro PR–AUC of 0.9489±0.0044. Conventional machine learning baseline comparisons showed that the proposed model achieved stronger internal accuracy, balanced accuracy, macro recall, macro F1-score, and macro ROC–AUC than classical machine learning algorithms trained on handcrafted acoustic features, although Random Forest remained competitive in macro PR–AUC. Ablation analysis shows that the deep spectro-temporal branch was the primary contributor to predictive performance, while the handcrafted branch provided complementary interpretable acoustic information rather than consistently improving all classification metrics. Explainability was incorporated using Grad-CAM and Integrated Gradients for spectrogram-based interpretation and SHAP for handcrafted-feature attribution. Domain-shift evaluation on the ICBHI Respiratory Sound Database and a COPD-focused cohort revealed substantial dataset shift effects, including poor healthy-case recognition on ICBHI and seed-dependent COPD recognition in the COPD-focused cohort. Identifier-aware sensitivity analyses showed lower performance than the main recording-level split, suggesting that subject-like or source-level overlap may inflate internal performance estimates. The findings should be interpreted as promising internal held-out recording-level algorithmic performance with limited external transfer, rather than evidence of readiness for clinical use.

Outline

Explainable Multi-Modal Deep Learning for Recording-Level Classification of Respiratory Audio Signals Under Internal and Domain-Shift Evaluation

More from our Archive