AI14 The performance of large language models on the dermatology Specialty Certificate Examination

doi:10.1093/bjd/ljag086.262

DOI: 10.1093/bjd/ljag086.262 ISSN: 0007-0963

AI14 The performance of large language models on the dermatology Specialty Certificate Examination

Mai Shehab, Wisam Alwan

Abstract

Large language models (LLMs) are increasingly being used for educational purposes. Previous studies evaluating the performance of LLM on the dermatology Specialty Certificate Examination (SCE), excluding image-based questions, reported overall accuracies of 63% for ChatGPT 3.5 and 90% for ChatGPT 4 (Passby L, Jenko N, Wernham A. Performance of ChatGPT on Specialty Certificate Examination in Dermatology multiple-choice questions. Clin Exp Dermatol 2024; 49: 722–7). This study assesses the performance of four current-generation LLMs on sample dermatology SCE questions, inclusive of image-based questions. In total, 104 single-best-answer dermatology SCE sample questions, available on the Membership of the Royal College of Physicians website, were individually entered into ChatGPT-5.1, Gemini 3.0, Claude Sonnet 4.5 and Grok 4.1. Extended-reasoning models were selected where available to reflect exam technique. Incorrect answers were collected and thematically analysed to identify trends among models, hallucination patterns and reasoning errors. Accuracy was highest for Gemini (98.1%), followed by ChatGPT (97.1%), Claude (93.3%) and Grok (89.4%). Four questions included images: three histopathology slides and one spot-diagnosis. All models correctly answered the spot-diagnosis question; however, no models answered all three histopathology items correctly. One question on cutaneous allergy showed disagreement among all models, with only Gemini identifying the correct answer. Despite high overall performance, incorrect responses were supported by persuasive explanations. Qualitative analysis demonstrated recurring error patterns, particularly in items requiring ‘single-best-answer’ reasoning and in distinguishing subtly different clinical presentations or histopathological features. Current LLMs demonstrate strong performance on dermatology SCE-style questions, yet the confident delivery of incorrect responses highlights important educational considerations. While LLMs can be a useful tool in moderating exam questions, they must be developed using robust, well-referenced training data. LLMs offer promise as adjunct revision tools; however, unrecognized errors may mislead learners who depend on openly accessible, noncurated models to learn dermatology.

Outline

AI14 The performance of large language models on the dermatology Specialty Certificate Examination

Abstract

More from our Archive