Evaluating Large Language Models in Turkish Short Answer Scoring: Validity, Reliability, and Fairness Perspectives

doi:10.35377/saucis...1835608

DOI: 10.35377/saucis...1835608 ISSN: 2636-8129

Evaluating Large Language Models in Turkish Short Answer Scoring: Validity, Reliability, and Fairness Perspectives

Abdulkadir Kara, Serkan Yıldırım

This study examines the performance of large language models (LLMs) in Turkish short-answer assessments within the measurement and evaluation theory framework. The GPT, Gemini, Gemma, and LLaMA models were evaluated under zero-shot and one-shot conditions with rubric support. The results show that LLMs have high internal consistency, but decision reliability can vary depending on prompt format and example sensitivity. Formulating rubrics with clear and concrete performance indicators increases model-human alignment and assessment fairness. Furthermore, error direction analyses revealed that models can exhibit systematic low-scoring tendencies. The results indicate that LLMs can support teachers in formative assessment with properly structured rubrics, but ethical oversight and pedagogical responsibility remain indispensable in final decisions.

Outline

Evaluating Large Language Models in Turkish Short Answer Scoring: Validity, Reliability, and Fairness Perspectives

More from our Archive