DOI: 10.3390/math14122197 ISSN: 2227-7390

VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval

Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah, Hyun Soo Kang

Reasoning-intensive multimodal retrieval suffers from a counter-intuitive bottleneck: on MM-BRIGHT multimodal-to-text (Query+Image → Documents), the strongest dense multimodal encoder reaches only 27.6 nDCG@10 and the rest of the dense vision–language retrievers cluster between 10.0 and 23.0. The visual signal, encoded as a dense vector, adds noise rather than evidence; even augmenting strong text retrievers with raw image captions degrades performance by up to 12.0 points. We propose VISA, a Visual Symbolic Agent that re-casts multimodal-to-text as text retrieval over three parallel streams. A Vision LLM is dispatched in three roles via separate prompts: a zero-shot router that classifies the query image into up to three parser types from a fixed taxonomy of nine (chart, circuit, equation, screenshot, code, figure, diagram, map, photograph); typed parsers that extract structured text per type; and a holistic captioner. The agent constructs three text streams (raw query, query ⊕ symbolic, query ⊕ caption), scores each with a single frozen 4B-parameter retrieval LLM, and fuses the per-document scores via Reciprocal Rank Fusion or a confidence-weighted linear combination. The whole agent contains no trainable parameters. The key novelty is a change of substrate: rather than projecting the query image into a dense multimodal vector that competes with text, VISA is, to our knowledge, the first retrieval system to convert the image into typed symbolic text and keep retrieval entirely text-side, so that a frozen text retriever can match the literal tokens (axis values, variable names, function signatures) that answering documents actually contain. Across all 29 MM-BRIGHT multimodal-to-text domains, VISA achieves 32.4 nDCG@10, an absolute improvement of +4.8 over the strongest dense multimodal encoder and substantially larger margins over the remaining six dense vision–language baselines. Per-domain analysis shows VISA maintains its margin across STEM and software domains where image content is structure-heavy. In practical terms, VISA is training-free and model-agnostic: it requires no fine-tuning, reuses any off-the-shelf vision LLM and text retriever, caches all per-image parsing so re-runs cost only three query encodes, and can therefore be dropped into an existing text-retrieval stack to add reasoning-intensive multimodal capability without building or training a multimodal encoder.

More from our Archive