DOI: 10.3390/diagnostics16121931 ISSN: 2075-4418

Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning

Yu-Hsuan Li, Yu-Cheng Cheng, Chih-Yun Chang, I-Te Lee

Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs—GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite—for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia of undetermined significance (AUS) cytology. Methods: Text prompting, few-shot learning, fine-tuning, and a hybrid strategy combining fine-tuning with few-shot learning were evaluated for each model. Performance was assessed using the Digital Database of Thyroid Images (DDTI; n = 80), a 1000-image test subset of TN5000, and an institutional AUS cohort with surgical pathology (n = 84). In the AUS cohort, the best-performing strategy was compared with the consensus classification of three endocrinologists and the American Thyroid Association (ATA) ultrasound risk stratification. Results: For GPT-4o, the hybrid strategy achieved the highest area under the receiver operating characteristic curve (AUC) in DDTI (0.866), TN5000 (0.689), and the AUS cohort (0.836). In the AUS cohort, its specificity was higher than that of endocrinologist consensus and ATA risk stratification when only high-suspicion nodules were classified as malignant (95.1% vs. 70.7% and 70.7%; p = 0.002 and p = 0.001, respectively), while sensitivity did not differ significantly (72.1% vs. 74.4% and 79.1%, respectively; both p > 0.05). However, the hybrid model misclassified 12 of 43 malignant nodules, corresponding to a false-negative rate of 27.9%. When high- and intermediate-suspicion ATA categories were classified as malignant, ATA sensitivity increased to 83.7% and specificity decreased to 56.1%; the hybrid model had a higher AUC than ATA risk stratification (0.836 vs. 0.749; p = 0.017). For Gemini 2.5 Flash-Lite, few-shot learning, fine-tuning, and the hybrid strategy did not improve AUC relative to text prompting in any dataset. Conclusions: The hybrid strategy produced the most consistent performance gains for GPT-4o across the three datasets but did not improve Gemini 2.5 Flash-Lite. The optimized GPT-4o model achieved high specificity in the diagnostically challenging AUS cohort, although its false-negative rate limits its use as a stand-alone diagnostic tool. Further validation in larger, prospective multicenter cohorts is required before clinical use.

More from our Archive