DOI: 10.3390/s26134192 ISSN: 1424-8220

Adapting a Foundation Monocular Depth Model for Soccer Video: From Synthetic Supervision to Match-Level Reliability

Ju-Seong Do, Ho-Young Jung

Soccer-video analysis centers on pitch-plane tracking, but camera-view depth cues such as occlusion and goal-area structure are not fully represented on the field plane. Synthetic benchmarks provide dense supervision unavailable for real broadcasts, but whether adaptation yields predictions that are reproducible across matches and operationally feasible remains unclear. We evaluate a Depth Anything V2 model adapted to SoccerNet-Depth with four components: Unaligned MDE accuracy, scale-and-shift aligned diagnostic, match-to-match reliability, and accuracy–cost trade-off. The model achieves an unaligned validation AbsRel of 0.00372. The aligned diagnostic shows that Base DAv2 retained substantial scene-depth structure, whereas SoccerNet adaptation enabled direct compatibility with the normalized target without per-frame ground-truth fitting. Relative to the VKITTI-fine-tuned reference, the adaptation improved all eight metrics in all 21 validation matches, with paired Wilcoxon tests significant after Bonferroni correction. On the challenge split, it reduced AbsRel by 34.1% versus the official baseline. The higher-resolution configuration improved the validation AbsRel by 5.9%, while the default retained a better accuracy–cost balance. At 401.57 ms per frame, the default is suited to post-match analysis, not live or near-real-time use. The study contributes a benchmark-scoped adaptation case study and protocol for foundation MDE on SoccerNet-Depth.

More from our Archive