DOI: 10.1145/3797148 ISSN: 2994-970X
Coding in a Bubble? Evaluating LLMs in Resolving Context Adaptation Bugs during Code Adaptation
Tanghaoran Zhang, Xinjun Mao, Shangwen Wang, Yuxin Zhao, Yao Lu, Zezhou Tang, Wenyu Xu, Longfei Sun, Changrong Xie, Kang Yang, Yue Yu
Code adaptation is a fundamental but challenging task in software development, requiring developers to modify existing code for new contexts. A key challenge is to resolve
Context Adaptation Bugs (CtxBugs)
, which occurs when code correct in its original context violates constraints in the target environment. Unlike isolated bugs,
CtxBugs
cannot be resolved through local fixes and require cross-context reasoning to identify semantic mismatches. Overlooking them may lead to critical failures in adaptation. Although Large Language Models (LLMs) show great potential in automating code-related tasks, their ability to resolve
CtxBugs
remains a significant and unexplored obstacle to their practical use in code adaptation.
To bridge this gap, we propose
CtxBugGen
, a novel framework for generating
CtxBugs
to evaluate LLMs. Its core idea is to leverage LLMs’ tendency to generate plausible but context-free code when contextual constraints are absent. The framework generates
CtxBugs
through a four-step process to ensure their relevance and validity: (1) Selection of four established context-aware adaptation tasks from the literature, (2) Perturbation via task-specific rules to induce
CtxBugs
from LLMs while ensuring their plausibility, (3) Generation of candidate variants by prompting LLMs without any context constraints and (4) Identification of valid
CtxBugs
through syntactic differencing and test execution in the target context. Based on the benchmark constructed by
CtxBugGen
, we conduct an empirical study with four state-of-the-art LLMs. Our results reveal their unsatisfactory performance in
CtxBug
resolution. The best performing LLM, Kimi-K2, achieves 55.93% on Pass@1 and resolves just 52.47% of
CtxBugs
. The presence of
CtxBugs
degrades LLMs’ adaptation performance by up to 30%. Failure analysis indicates that LLMs often overlook
CtxBugs
and replicate them in their outputs. This suggests that LLMs overly focus on the local code correctness of the reused code while ignoring its compatibility in the target context. Our study highlights a critical weakness in LLMs’ cross-context reasoning and emphasize the need for new methods to enhance their context awareness for reliable code adaptation. The replication package for this paper is at https://github.com/ztwater/CtxBugGen.