Look Before You Leap: Context-Sensitive GUI Grounding for Boosting Automated Extended Reality (XR) Testing
Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. LyuIn recent years, Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual or virtual-real environments. Users can interact with XR applications (apps) through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). IGE constitutes the fundamental element of XR GUI, embodying rich semantic information. The accurate recognition and precise understanding of these IGEs is instrumental, serving as the foundation of GUI grounding, which can facilitates downstream tasks, including automated XR testing. A straightforward XR test generator can interact randomly within the app’s 3D environment, making it trapped in uninteractable space and resulting in an ineffective and inefficient testing process. In contrast, a more intelligent test generator, informed by the accurate locations and semantics of IGEs, can make wiser decisions on interaction targets and orders, forming test sequences that cover more functionalities faster. The most recent IGE detection approaches in SE are designed for 2D mobile apps and typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to XR apps.
In this paper, we propose the first zero-shot context-sensitive interactable GUI element detection framework for Extended Reality apps, named Orienter. Rather than relying on generic visual grounding which fails in 3D environments, Orienter introduces a structured workflow tailored to XR constraints. It first synthesizes XR-specific semantic contexts (e.g., global interaction paradigms and 3D spatial properties) before performing detection. To overcome severe spatial hallucinations inherent in LMMs, the detection process is iterated within an XR-constrained reflection loop. Specifically, Orienter contains three components, including (1)