DOI: 10.3390/app16126187 ISSN: 2076-3417

Respiratory Rate Estimation from Audio Using Object Detection with Learnable Spectrograms

Bernhards Bertulis, Jevgenijs Telicko, Andris Jakovics

Sound event detection models commonly rely on spectrogram representations of audio signals and recent approaches have adapted image-based object detection architectures to acoustic domains. This paradigm is suitable for respiratory monitoring, where breathing events are visually distinguishable even under noisy conditions. In this study, we propose a Representation Enhancement for Neural Imaging (RENI) framework that combines a modified You Only Look Once (YOLO) object detection head with a trainable spectrogram front-end implemented using nnAudio. The front-end enables GPU-accelerated waveform-to-spectrogram conversion while allowing adaptive learning of Short-Time Fourier Transform (STFT) and Melody (Mel) basis functions. The model was trained for breathing-phase localization and respiratory rate estimation from 44.1 kHz audio recordings acquired during exercise. The results show that the trainable Mel representation improves respiratory-rate accuracy compared with static and trainable STFT configurations, achieving a mean absolute error of 1.15 breaths per minute. Bootstrap 95% confidence intervals and one-sided permutation tests show statistically significant gains for selected trainable STFT and Mel configurations under the min-MAE confidence thresholding protocol, while pooled effects remain directionally favorable for the trainable Mel front-end. These findings demonstrate improved exhale-based respiratory rate estimation under the studied conditions, while broader external validation is still required.

More from our Archive