Vocal-Eyes: AI-Powered Smart Glasses for the Blind Using Transformer-Based Architecture and Scene Graph Generation
Amna Shabbir, Uzma Afsheen, Muhammad Faizan Shirazi, Abdul Rauf, Syed Muhammad Meesam Abbas, Shahid Saeed, Abdul Samad Khan, Safdar Rizvi, Nurashikin SaaludinVisually impaired individuals face significant challenges in autonomous mobility and situational awareness. Most existing assistive technologies address isolated tasks, such as object recognition or text reading, while failing to capture broader environmental context. This work addresses this limitation by proposing a scene-sensitive, low-cost assistive system that delivers holistic situational information. We present Vocal-Eyes, an intelligent smart glasses platform that provides periodic audio descriptions of the surrounding environment. The system employs a cloud-based neural processing pipeline in which visual features are extracted using a Transformer-based architecture. Relational context is modeled through scene graph generation, and scene graphs are translated into natural language via a graph-to-text module. A lightweight hardware prototype captures visual data locally, while computationally intensive processing is offloaded to the cloud to reduce power consumption. The experimental results show that relational, scene-based narration produces more coherent and informative descriptions than object-centric approaches while maintaining acceptable periodic latency. Cost analysis further indicates that Vocal-Eyes is significantly more affordable than comparable commercial smart glasses solutions. These results demonstrate that Transformer-based scene understanding with cloud-assisted processing is an effective and practical approach for developing accessible, context-aware assistive technologies for visually impaired users.