DOI: 10.3390/nu18122017 ISSN: 2072-6643

Prompt Engineering and Model Selection for LLM-Based Nutritional Estimation from Food Images: A Multi-Dataset Investigation

Shinichi Nakagawa, Akira Yamamoto

Background/Objectives: Accurate estimation of nutritional content from food images has important applications in dietary assessment and public health surveillance. While large language models (LLMs) have shown promise for this task, the effects of prompt design and model selection on estimation accuracy remain poorly characterized. Methods: We evaluated three Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.6) for visual estimation of five mandatory nutritional components (energy, protein, fat, carbohydrate, and salt equivalent) across three datasets: NutriImage (691 Japanese meal photographs with dietitian-validated ground truth, after OCR-mask quality filtering), SNAPMe (1463 US meal photographs from a publicly available benchmark), and the Japan Branded Food Database (JBFD; 989–1000 packaged food product images). We systematically compared a default prompt and a visual estimation prompt explicitly instructing the model not to read any text or numbers visible in the image. Results: The visual estimation prompt substantially improved accuracy when paired with a sufficiently capable model (energy R2: 0.23 for Haiku to 0.60 for Sonnet, JBFD). Sonnet and Opus substantially outperformed Haiku across all datasets, while differences between Sonnet and Opus were small (MedAPE difference 1–3 percentage points). Packaged food images (JBFD) yielded higher R2 than meal photographs. Salt equivalent showed consistently poor accuracy (MedAPE 34–64%). On SNAPMe, Sonnet achieved lower energy MAE (116.9 vs. 123.0 kcal, −4.9%) and lower MAE for protein (5.9 vs. 7.9 g, −25.7%) and fat (6.6 vs. 8.7 g, −24.5%) compared with a recent ChatGPT-5 study. Conclusions: Claude Sonnet offers the best cost-performance balance for LLM-based nutritional estimation. Prompt design substantially affects accuracy, but only when paired with a sufficiently capable model; model visual recognition capability appears to be a key determinant of performance. These findings highlight the inherent difficulty of this task and provide practical guidance for dietary assessment system development.

More from our Archive