DOI: 10.3390/metrics3030011 ISSN: 3042-5042

Benchmarking Multimodal Mathematical Reasoning: Prompt Effects, Modality Gaps, and Failure Modes

Gökan Görer, Maria Osipenko, Thomas Knispel

Large language models and vision–language models already achieve strong results on reasoning tasks, but their reliability under controlled assessment-style conditions remains insufficiently characterized. This paper presents a benchmark study of multimodal multiple-choice mathematical reasoning using 324 Austrian Mathematical Kangaroo competition problems (2022–2024), including both text-only and diagram-dependent items. We evaluate five state-of-the-art models under a controlled protocol that isolates two factors: input modality and prompt format. We compare a strict short-answer condition requiring a single option label (one_liner) with a structured condition eliciting step-by-step reasoning and an explicit final answer (full) while enforcing deterministic decoding and rule-based answer extraction. Performance is assessed using accuracy, abstention rates, and contest-style scoring, supported by paired and unpaired statistical analyses and a structured error taxonomy. The results show that prompt format is the primary driver of performance: structured prompting yields substantial gains across all the models, particularly on text-only items. In contrast, visual-text problems remain consistently harder, with a robust performance gap that persists across prompting conditions, indicating persistent limitations in visual grounding. Model comparisons are additionally influenced by response strategies, especially abstention behavior under strict output constraints. An error analysis reveals systematic failure modes, including constraint violations, inappropriate strategy selection, and diagram misinterpretation, alongside structured biases in multiple-choice selection under constrained prompting. Overall, the findings demonstrate that measured performance is highly sensitive to the interaction between prompt format and input modality. This underscores the importance of treating prompting, decoding, and answer extraction as integral components of evaluation in assessment-oriented settings, where reliability and reproducibility are central.

More from our Archive