Artificial Intelligence Performance Under Different Conditions in Answering China's Standardized Training Examination for Resident Physician in Radiology: A Comparative Analysis

doi:10.1002/hcs2.70085

DOI: 10.1002/hcs2.70085 ISSN: 2771-1749

Artificial Intelligence Performance Under Different Conditions in Answering China's Standardized Training Examination for Resident Physician in Radiology: A Comparative Analysis

Zheng Zhu, Yanfeng Zhao, Lin Li, Xiaoyi Wang, Yongming Zhang, Xinming Zhao

Show PDF Cite

ABSTRACT

Background

The capabilities of general‐purpose large language models (LLMs) on specialized medical examinations have not been systematically compared. To evaluate the performance differences among three LLMs—DeepSeek‐R1, ChatGPT‐o1, and Gemini‐2.0—in answering questions from China's Standardized Training Examination for Resident Physicians in Radiology, and to assess the impact of different questioning conditions (with/without answer choices, and the introduction of doubt) on model accuracy.

Methods

A total of 131 questions were analyzed at one time. The LLMs were subjected to two tasks (questions with/without answer choices). Each task included three conditions: no doubt, weak doubt, and strong doubt, with the latter two presented as follow‐up questions after the models' initial responses. Subjective evaluation was conducted using the 5‐point Likert scale.

Results

In both tasks, Gemini‐2.0 achieved the highest accuracy (0.763–0.809) and (0.595–0.679). In Task 2, all LLMs' accuracy was lower than in Task 1, but only DeepSeek‐R1 showed statistical significance ( p = 0.014). The introduction of doubt (weak or strong) did not significantly increase accuracy in Task 1 for any model. In Task 2, only DeepSeek‐R1 showed a slight increase in accuracy under doubt conditions. All LLMs performed significantly better on single‐choice questions than on multiple‐choice questions ( p < 0.05) and showed superior performance in case‐based questions vs. knowledge‐based questions. The subjective evaluation scores were 4.46 for DeepSeek‐R1, 3.92 for ChatGPT‐o1, and 3.83 for Gemini‐2.0. It was noted that ChatGPT‐o1 negotiated not to answer all questions at once and had missing responses.

Conclusions

LLMs exhibit variable proficiency in tackling radiology resident examination questions, with Gemini‐2.0 showing the highest overall accuracy. However, repeated self‐examination of LLMs through the introduction of doubt does not consistently or significantly enhance their performance on radiology‐related questions.

Outline

Artificial Intelligence Performance Under Different Conditions in Answering China's Standardized Training Examination for Resident Physician in Radiology: A Comparative Analysis

ABSTRACT

Background

Methods

Results

Conclusions

More from our Archive