Automated detection and annotation of toothed-whale whistles using transformer-based instance segmentation
Xixin Zhang, Xiaobai Liu, Michaela N. Alksne, Marie A. RochAccurate detection and fine-scale annotation of dolphin whistles are crucial for understanding marine mammal communication and population dynamics. Dolphin-whistle annotation is challenging due to highly variable signals, overlapping calls including echolocation clicks, attenuation, and noisy backgrounds. Most existing methods treat the task as a spectrogram peak-tracking problem, linking neighboring detected peaks using heuristic or statistical methods, and rely on manual feature engineering. While effective for long, clear whistles, they lack generalizability for short, weak whistles across species. We reformulated dolphin-whistle detection as an instance-segmentation task, introducing an end-to-end transformer model that predicted complete whistle contours directly from spectrograms, eliminating peak-detection and trajectory-reconstruction stages. To overcome manual labeling limitations, we integrated a human-in-the-loop training paradigm that iteratively refined annotations, improving both data quality and model performance. We demonstrated the effectiveness of this architecture on a subset of the detection, classification, localization, and density estimation 2011 corpus where we partitioned training and test data such that test data were from species, locations, and hydrophones that were excluded from the training data. Experiments showed that our end-to-end system generalized effectively in these conditions, achieving 89.99% precision and 80.65% recall for all whistles, and 85.81% precision with 88.44% recall for whistles longer than 150 ms.