ProjectEvalPlus: An Agentic Software Engineering Benchmark with Automatic Language Extension and User Simulated Evaluation

doi:10.1145/3817119

DOI: 10.1145/3817119 ISSN: 1049-331X

ProjectEvalPlus: An Agentic Software Engineering Benchmark with Automatic Language Extension and User Simulated Evaluation

Kaiyuan Liu, Youcheng Pan, Yexing Du, Lei Zhang, Daojing He, Yang Xiang

Large Language Models (LLMs) have shown impressive capabilities in software engineering (SWE), yet evaluating their performance on realistic project level SWE remains challenging. Existing benchmarks focus on isolated snippets, lacking project-level context, language extensibilities and end-user reality. Thus, we introduce ProjectEvalPlus , a comprehensive user-interaction-simulation benchmark that extends automatic evaluation to multiple programming languages via Automatic Language Extensibility (ALE) , supports multi-agent scenario stages evaluation, and provides detailed feedback on logical and runtime errors. Through unified testing on Python, Java, and JavaScript, we demonstrate that ProjectEvalPlus effectively assesses LLM agents’ capabilities, reveals language-specific biases, and enables iterative cross-language adaptation. Our results highlight both the strengths and limitations of current LLM agents in practical software development, offering actionable insights for future multi-language, project-level software engineering evaluation.

Outline

ProjectEvalPlus: An Agentic Software Engineering Benchmark with Automatic Language Extension and User Simulated Evaluation

More from our Archive