Leveraging GPT-4o to Evaluate the Usability and Accessibility of Flutter App Interfaces
Shristi Shrestha, Anas MahmoudOver the past few years, Flutter has become a leading framework for cross-platform mobile application (app) development. By enabling developers to build apps for multiple operating systems from a single codebase, Flutter can substantially reduce development time and costs, particularly for early-stage ventures seeking broad user reach. However, ensuring robust UI testing across multiple platforms and devices while maintaining compliance with platform guidelines can be a resource-intensive process. To address this challenge, we explore the potential of GPT-4o, a general-purpose LLM, to automatically assess the quality of Flutter apps. We hypothesize that the declarative nature of Flutter code enables GPT-4o to infer UI layouts directly from source code and reason about their design attributes. To validate our assumption, we conducted a developer study involving 15 participants reviewing GPT-4o's assessment of 15 Flutter app screens. Our results show that a generic prompt can guide the model to accurately identify usability and accessibility issues in Flutter code, suggest actionable fixes, and flag potential violations of app store policies and guidelines. Our approach is implemented in Fuel, a tool for F lutter U I E valuation using L LMs. An exit survey with our study participants revealed both the potential and barriers of using Fuel as a UI quality assurance resource in agile, cross-platform development.