DOI: 10.1142/s0129156425407594 ISSN: 0129-1564

A Multiagent Scorer for Improved AES of International Students’ Essays

Wenwen Cheng, Zhen Li

Automated Essay Scoring (AES) is critical for efficient educational assessment but faces challenges in evaluating complex, multifaceted writing criteria and emulating nuanced human judgment. Especially in the field of international Chinese language education, there are challenges such as large differences in learners’ language levels, complex cultural backgrounds, and unique norms of Chinese writing. The existing methods, including single Large Language Models (LLMs), often struggle with consistent rubric alignment, explainability, and capturing subtle quality differences. To address these limitations, we propose MARES (MultiAgent Rubric guided Essay Scorer for Enhanced AES), a novel framework leveraging a society of collaborating LLM agents. MARES decomposes the scoring process into three phases: (1) multidimensional analysis by specialized “Expert Agents” focusing on dimensions like Content, Structure, Language, and Grammar; and cultural adaptation, Chinese character norms, cross-cultural expressions (2) collaborative deliberation among “Deliberator Agents” to synthesize findings against the scoring rubric; and (3) final score and feedback generation by a “Synthesizer Agent.”. It includes the demonstration of Chinese character writing, analysis of cultural allusions, and cross-cultural comparisons. Extensive experiments on the benchmark ASAP-AES dataset demonstrate that MARES significantly outperforms traditional, deep learning, and strong single-LLM baselines, achieving state-of-the-art performance measured by Quadratic Weighted Kappa (QWK). Extensive experiments on international Chinese learner composition datasets have shown that MARES significantly outperforms traditional, deep learning, and powerful single LLM baselines, and performs particularly well on Chinese-specific metrics such as Chinese character writing accuracy, appropriateness of cultural expressions, and the correctness of idioms usage. Ablation studies validate the contribution of each component, and further analyses confirm MARES’s generalizability across different base LLMs, presenting a promising direction for more accurate, nuanced, and potentially explainable AES.