DOI: 10.1177/10731911261455137 ISSN: 1073-1911

Evaluating LLM-Based Coders in Psychological Assessment: A Validation Framework With Application to the Rorschach Morbid Content Variable

Ruam P. F. A. Pimentel, Gregory J Meyer

Large language models (LLMs) are increasingly used to support psychological assessment, but standards for evaluating their scoring accuracy remain limited. This article introduces a clear, reproducible validation framework to evaluate LLM-based scoring systems. The framework separates pre-validation steps (e.g., balancing base rates, refining prompts, and comparing models) from a standardized validation phase focused on reliability and validity benchmarks. We demonstrate its application with a case study of Morbid Content (MOR) scoring in the Rorschach task, using a two-agent LLM workflow. In an independent dataset ( n = 84; 2,176 responses) with natural MOR base rates, the final LLM coder showed good response level agreement ( kappa = .72–.74) and excellent protocol level agreement ( ICC = 0.94–0.95) with assessors, near-perfect consistency with itself (ICC = 0.97–0.99), and replicated external validity ( r = .59–.71) that matched human coders ( r = .54–.65). This article offers a practical guide for evaluating automated coders in psychological testing and discusses practical decisions and ethical considerations.

More from our Archive