DOI: 10.3390/jcm15135149 ISSN: 2077-0383

Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study

Jacek Kotula, Marcin Konarzewski, Jakub Polkowski, Krzysztof Kotula, Joanna Lis, Rafal Porowski, Anna Ewa Kuc, Beata Kawala, Michal Sarul

Background/Objectives: Automated cephalometric landmark detection using deep learning has the potential to streamline routine orthodontic diagnosis. However, the clinical relevance of artificial intelligence (AI) localisation accuracy depends on how detection errors propagate into derived angular measurements and skeletal classifications. We retrospectively evaluated 14 YOLO-based model configurations and quantified the agreement between AI-derived and expert-derived ANB-based skeletal classifications. Methods: Twelve working YOLO-based models (YOLOv5xu, YOLOv11 nano/small/medium/large variants) were trained on a single-centre dataset of 120 lateral cephalograms and evaluated on an independent test set of 11 cephalograms (stratified across skeletal Classes I, II, III). The four ANB-defining landmarks (Sella, Nasion, A-point, B-point) were the focus of the analysis. Each test cephalogram had been annotated by four orthodontists (44 measurements per image), yielding the expert reference. We assessed the effects of architecture, bounding-box size (40/100/150 px), training dataset scale (235–4255 images) and training epochs on localisation accuracy (mean radial error, MRE; Successful Detection Rate, SDR) and on the downstream ANB-based skeletal classification. Diagnostic concordance was quantified by classification agreement, Cohen’s κ with bootstrap 95% confidence intervals (10,000 iterations), an exact one-sided binomial test for discordance, and Wilson exact CIs per class. Results: The best-performing model (Model 2; YOLOv11l, 40 × 40 px bounding box, 1175 training images) achieved an MRE of 3.10±1.00 mm and a SDR@4 mm of 87.2% for S, N, A, and B. ANB-based skeletal classification demonstrated 96.9% concordance with expert assessments (95% bootstrap CI: 93.8–99.2%; Cohen’s κ = 0.946 [95% CI 0.89–0.99]; exact binomial test against a 90% concordance threshold p=0.003). Per-class concordance was Class I 95.8% (23/24), Class II 94.9% (56/59), and Class III 100% (47/47). Three of four discordant cases clustered near the Class I/II diagnostic threshold (expert ANB ≈4.5°). Bounding-box size dominated localisation accuracy, with a 3.5-fold increase in MRE from 40 × 40 to 150 × 150 px configurations and SDR@4 mm collapsing from 82.8% to 0%. Conclusions: Within the constraints of a retrospective single-centre design with a small (n = 11) independent test set, YOLO-based AI landmark detection demonstrated promising diagnostic concordance with expert consensus for ANB-based skeletal classification. These findings warrant prospective, multi-centre external validation before clinical deployment and support a confidence-aware workflow in which AI predictions for borderline ANB values undergo mandatory clinician verification. Bounding-box calibration emerged as the single most impactful preprocessing decision.

More from our Archive