Can Large Language Models Reason About Complex Execution Paths? An Empirical Study on Python
Wenhan Wang, Kaibo Liu, Zeyu Sun, An Ran Chen, Ge Li, Gang Huang, Lei MaExecution path reasoning is a key step towards program semantics understanding. It is crucial for generating test cases that cover certain branches/paths, or detecting bugs that are triggered by some paths without actually executing the program. Traditionally, execution path reasoning can be achieved by symbolic execution techniques, but existing SMT-based symbolic execution approaches struggle with complex data structures and external API calls. This challenge is even more pronounced in languages with highly flexible syntax, such as Python, resulting in a lack of widely adopted tools for reasoning on execution paths. Therefore, reasoning execution paths with AI-based approaches become a promising direction.
In this paper, we investigate the feasibility of adopting large language models (LLMs) for execution path reasoning on Python, where traditional path-based symbolic execution tools are unavailable. We conduct an empirical study on two types of path reasoning tasks: generation tasks for test case generation and classification tasks for bug detection. We build new evaluation pipelines and benchmarks from both competition-level programs and real-world repositories. Our results show that state-of-theart LLMs can perform correct reasoning on execution paths and improve test coverage on real-world software, though models with stronger reasoning abilities do not always outperform weaker ones. These findings highlight the potential of utilizing LLMs as a complementary heuristic for path-aware code reasoning, especially in program languages lacking mature symbolic execution tools. We have released our benchmark and evaluation scripts at