P002 A systematic review comparing artificial intelligence models’ ability to diagnose dermatological conditions
Clark Keith Ejercito Chavez, Victor TurcanuAbstract
Artificial intelligence (AI) models, especially large language models (LLMs) such as ChatGPT and Claude Opus, are increasingly being explored for their diagnostic capabilities in dermatology. However, the accuracy and reliability of these AI systems in diagnosing dermatological conditions still remain unclear compared with clinicians. This systematic review evaluates and compares the diagnostic performance of AI models and clinicians in dermatological diagnosis, focusing on accuracy, differential diagnosis and the ability to distinguish malignant from benign conditions. A systematic search was conducted in PubMed using keywords related to ChatGPT, Claude Opus, LLMs and dermatological diagnosis. Studies were included if they evaluated the diagnostic performance of AI using dermatological images or clinical data, and provided a comparison with other AI models or clinicians. Opinion articles, systematic reviews and nondiagnostic studies were excluded. A PICO framework was also used, which helped guide the review. The review included 13 studies after inclusion criteria were applied. A thematic analysis was conducted between each study, which found that ChatGPT had a primary diagnostic accuracy of 38–72% across different conditions. ChatGPT-4 also outperformed ChatGPT-3.5 across all studies. Claude 3 Opus had comparable performance to ChatGPT. Google Lens performed worse overall than ChatGPT. SkinGPT-4 was particularly helpful in providing suggestions for investigations; however, it struggled with skin of colour. Clinicians consistently outperformed all AI models, apart from some studies having close scores with ChatGPT-4. AI models show promise as adjunct tools in dermatological diagnostics, but remain inferior to clinicians in terms of diagnostic accuracy and reliability. There are several potential requirements if AI systems are to be fully implemented into future clinical practice. These include collaborative development between AI researchers and dermatologists, including through potential direct platforms where clinicians can provide representative datasets directly to LLM developers; streamlined bias reporting systems; and conducting randomized control trials on AI models to further improve validity and reliability.