AI06 Can large language models replicate expert consensus in dermatology research prioritization?

doi:10.1093/bjd/ljag086.254

DOI: 10.1093/bjd/ljag086.254 ISSN: 0007-0963

AI06 Can large language models replicate expert consensus in dermatology research prioritization?

Luke Carson, Ravi Ramessur, Neil Rajan, Emanuele Trucco, Rubeta N Matin

Abstract

Expert consensus is essential to prioritize clinical research funding towards unmet patient needs. The Delphi method is widely used but requires multiple survey rounds and considerable clinician time. The British Association of Dermatologists and American Academy of Dermatology undertook an eDelphi exercise prioritizing artificial intelligence (AI) research questions, which required over 100 h of expert time across 3 years. We aimed to evaluate whether current LLMs can accurately replicate expert consensus from a formal international eDelphi exercise identifying AI research priorities in dermatology. We compared outputs from six leading LLMs against unpublished eDelphi results (reference standard). The eDelphi included 110 research questions curated from 429 submissions by 101 dermatology clinicians, nurses and primary care physicians. ChatGPT-4o, Gemini 2.5 Flash, DeepSeek DeepThink R1, and GPT5.1 Auto were each provided the complete question list with a standardized prompt to simulate the expert panel process. LLMs rated questions on the same 0–5 Likert scale and generated top 10 priority lists. Performance was assessed using Spearman correlation and top 10 overlap analysis. All LLMs successfully completed the simulation. The best-performing model, Gemini 2.5 Flash, demonstrated moderate-to-strong correlation with expert consensus (Spearman’s ρ = 0.61, P < 0.001) and identified five of the expert top 10 priority questions in the best of three runs. Notable inter-run variability was observed despite identical prompts. LLMs underscored questions involving precise clinical pathways and patient engagement themes, suggesting limitations in capturing dermatologists’ emphasis on patient partnership and clinical context. LLMs show promise for integration into hybrid consensus models, potentially reducing expert time requirements by replacing the second Delphi round while maintaining clinical oversight. Our data demonstrate the role of LLMs to potentially accelerate research prioritization and guideline development, enabling faster translation to patient benefit across dermatology.

Outline

AI06 Can large language models replicate expert consensus in dermatology research prioritization?

Abstract

More from our Archive