RAG-Enhanced Vision–Language Framework and Dataset for Railway Signal Cognition and Safety Reasoning
Qunbo Wang, Shiyi Xiong, Jiawei Li, Weiliang Li, Chu Huang, Sen Zhang, Xize Guo, Chao Fan, Wenjun WuRailway scene understanding is critical for ensuring train operational safety and advancing intelligent railway systems. Existing railway vision methods mainly focus on perception and classification, while lacking regulation-guided semantic reasoning capabilities in complex environments. To address these limitations, this paper proposes a retrieval-augmented generation (RAG)-enhanced vision–language framework for railway signal cognition and safety reasoning. The proposed method integrates railway signal perception, regulatory knowledge retrieval, and multi-modal reasoning to improve factual consistency, reasoning reliability, and operational interpretability. In addition, a dedicated railway signal dataset comprising 500 standardized railway scene images with structured QA annotations is constructed to support regulation-oriented multi-modal recognition evaluation. Experimental results show that the proposed framework improves reasoning accuracy from 28.40% to 67.20% with an average end-to-end inference latency of 11.31 s per sample, and the inference speed can be further improved by adjusting experimental configurations to trade off between efficiency and accuracy, demonstrating the potential of RAG-enhanced architectures as a foundational step toward reliable multi-modal cognition in intelligent railway systems.