WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements
Xiwen Teoh, Yun Lin, Duc-Minh Nguyen, Ruofei Ren, Wenjie Zhang, Jin Song DongVisual language model (VLM) agents show great promise in automating graphical user interface (GUI) testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish whether it stems from the hallucination or a real application bug. Addressing this issue presents two core technical challenges: (1) limited capability and accuracy in deriving implicit test oracles, where the agent must act as its own oracle to implicitly decide if the application’s behavior is correct without guidance, and (2) limited reliability due to probabilistic inference, where an LLM’s inconsistent reasoning undermines its trustworthiness as an oracle.
We introduce WebTestPilot, a neurosymbolic LLM-based approach that addresses both challenges through symbolization. WebTestPilot detects and abstracts critical GUI elements of a web application into symbolic variables. This design improves reliability by constraining assertion generation to operations grounded in explicitly defined symbols, thereby reducing unconstrained or inconsistent reasoning. At the same time, it improves accuracy by representing application states and their relationships in a structured symbolic form, which increases the likelihood of the agent recognizing data, causal, and temporal dependencies across states. Together, these capabilities enable WebTestPilot to generate reliable and accurate test oracles that capture meaningful implicit expectations derived from test requirements. To advance research in this area, we build a benchmark of bug-injected web apps for evaluating NL-to-E2E testing. The results show that WebTestPilot achieves a task completion rate of 99%, with 96% precision and 96% recall in bug detection, outperforming the best baseline (+70 precision, +27 recall). The agent generalizes across diverse natural language inputs (i.e., those containing typos, grammatical errors, redundant sentences, stylistic restyling, or abbreviations) and model scales (3B–72B). In a real-world deployment with a no-code platform, WebTestPilot discovered 8 bugs during development, including data binding, UI, and navigation issues.