Evaluating GPT-4o on Introductory Economics Problem Sets: A Rating-Based Benchmark of Question Type and Assessment Design
Yun Liu, Tina Wong, Tak Wai Chau, Yuchang CaoProblem sets are widely used in economics education, but the availability of generative artificial intelligence creates new challenges for assessment validity and academic integrity. This study conducts a rating-based benchmarking evaluation of GPT-4o on introductory economics problem-set items. Using 260 rated questions and three independent instructor ratings per item, we examine how GPT-4o performance varies across discussion, numerical, graphical and mixed-modality questions. The study is descriptive rather than causal: it benchmarks GPT-4o outputs under a specified prompting, input-modality and scoring protocol. All items were submitted through the ChatGPT web interface in fresh sessions, each item was answered once, graphical items were provided through uploaded original diagram images, no follow-up prompts were used, and outputs were saved without editing. Results show that GPT-4o performs comparatively well on text-based discussion and numerical-only items, but substantially less well on graphical items, especially those requiring numerical reasoning grounded in a graph. Inter-rater reliability is high according to intraclass correlation coefficients, and pooled rater–item analyses confirm the graphical and graphical–numerical performance gap as a descriptive benchmark pattern. To improve reproducibility while respecting copyright restrictions, the revised manuscript specifies the prompting protocol, coding procedure, rater procedure and supplementary replication files. The findings suggest that economics instructors should not simply add graphical questions as an “AI-proof” device, but should design constructively aligned, accessible, mixed-format assessments that validly sample intended economic reasoning skills. The conclusions are restricted to GPT-4o, the February–June 2025 testing window, the ChatGPT web-interface protocol and the item corpus used in this study.