Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios
Lei Wang, Yiqing ShenLarge language models (LLMs) have shown their capabilities in numerical and logical reasoning, yet their capabilities in higher-order cognitive tasks, particularly causal reasoning, remain less explored. Current research on LLMs in causal reasoning has focused primarily on tasks such as identifying simple cause-effect relationships, answering basic “what-if” questions, and generating plausible causal explanations. However, these models often struggle with complex causal structures, confounding variables, and distinguishing correlation from causation. This work addresses these limitations by systematically evaluating LLMs’ causal reasoning abilities across three representative scenarios, namely analyzing causation from effects, tracing effects back to causes, and assessing the impact of interventions on causal relationships. These scenarios are designed to challenge LLMs beyond simple associative reasoning and test their ability to handle more nuanced causal problems. For each scenario, we construct four paradigms and employ three types of prompt scheme, namely zero-shot prompting, few-shot prompting, and Chain-of-Thought (CoT) prompting in a set of 36 test cases. Our findings reveal that most LLMs encounter challenges in causal cognition across all prompt schemes, which underscore the need to enhance the cognitive reasoning capabilities of LLMs to better support complex causal reasoning tasks. By identifying these limitations, our study contributes to guiding future research and development efforts in improving LLMs’ higher-order reasoning abilities.