Evaluating ChatGPT’s Ability to Provide Patient Care Information in Foot and Ankle Surgery: A Comparative Analysis

doi:10.1177/19386400261456922

DOI: 10.1177/19386400261456922 ISSN: 1938-6400

Evaluating ChatGPT’s Ability to Provide Patient Care Information in Foot and Ankle Surgery: A Comparative Analysis

Allison J. Lewis, Zinoubia Hasasna, Alexandra Krez, Geoffrey Phillips, Adam D. Bitterman

Background. Patients increasingly use the Internet and artificial intelligence (AI) platforms ChatGPT for medical information, raising concerns about the accuracy and clinical depth of AI-generated content. This study evaluated the reliability and clinical utility of ChatGPT (GPT-3.5 and GPT-4.0) for common foot and ankle conditions compared with patient education materials from the American Orthopaedic Foot & Ankle Society (AOFAS) FootCareMD. Methods. Between January 20 and 26, 2025, standardized prompts were used to query GPT-3.5 and GPT-4.0 across 15 common foot and ankle conditions. ChatGPT responses were compared with AOFAS FootCareMD content based on the number of symptoms, risk factors, and treatment options provided. Two fellowship-trained foot and ankle orthopaedic surgeons independently evaluated response accuracy, categorizing outputs as <50%, 50% to 74%, 75% to 99%, or 100% accurate. Paired t-tests were used for statistical comparisons, and inter-rater reliability was assessed using Cohen’s weighted kappa. Results. GPT-4.0 generated significantly more symptoms than AOFAS content (P = .015). In contrast, GPT-3.5 listed significantly fewer treatment options than both AOFAS and GPT-4.0 (P = .042). When addressing surgical management, both ChatGPT versions frequently provided vague or incomplete information. GPT-3.5 referenced surgery without procedural detail in 53% of responses, while GPT-4.0 lacked detailed surgical explanations or omitted them entirely in 80% of responses. Overall accuracy ratings were high, with 77% of responses judged as 75% to 99% accurate and only 3.4% rated below 50% accuracy. However, inter-rater agreement between surgeons was poor (κ = −0.02), for responses labeled as 100% accurate, highlighting subjectivity in grading AI-generated medical content. Conclusion. ChatGPT effectively provides general information on foot and ankle conditions, regarding causes and symptoms, and GPT-4.0 offers more comprehensive treatment discussions than GPT-3.5. Nevertheless, its limited depth and specificity regarding surgical options restrict its clinical usefulness. Until further improvements are made, AI-generated content should serve as a supplement rather than a replacement for expert-reviewed patient education resources.

Level of Evidence: Level III Case Control Study

Outline

Evaluating ChatGPT’s Ability to Provide Patient Care Information in Foot and Ankle Surgery: A Comparative Analysis

More from our Archive