DOI: 10.1145/3822602 ISSN: 1551-6857

QG-STR: Training-Time Optimized Question-Guided Scene Text Recognition via Visual Question Answering

Quanxing Xu, Ling Zhou, Xian Zhong, Feifei Zhang, Rubing Huang

Scene Text Spotting (STS) aims to transcribe text embedded in natural images, typically encompassing Scene Text Detection (STD) and Scene Text Recognition (STR). Advances in image understanding have made end-to-end text spotting increasingly viable. Concurrently, multimodal research has highlighted the potential of vision-language reasoning tasks, such as Visual Question Answering (VQA). To leverage multimodal reasoning for STR, we propose a training-time question-guided STR framework that integrates VQA, termed

Q
uestion-
G
uided
S
cene
T
ext
R
ecognition (QG-STR). The framework unifies STR, Visual Question Generation (VQG), and VQA within a single architecture, enabling multimodal reasoning to enhance text-spotting performance. Specifically, visual understanding and logical reasoning are used as supervisory signals during training to improve text recognition accuracy and boost end-to-end text spotting. QG-STR is model-agnostic and compatible with diverse STR and VQA architectures, employing question guidance solely as a training-time supervision mechanism. During inference, the STR module functions independently without requiring external questions. Extensive experiments on
Total-Text
,
ICDAR2015
,
ICDAR2013
, and
CTW1500
validate the effectiveness of QG-STR.

More from our Archive