DOI: 10.1145/3820781 ISSN: 1551-6857

A Reciprocal Interaction Framework for Collaborative Temporal Grounding and Question Answering in Egocentric Videos

Jiaxu Wang, Tianshan Liu, Bing-Kun Bao

Collaborative Temporal Grounding and Question Answering (CTGQA) in egocentric videos enables users to inquire about past visual experiences and obtain corresponding temporal segments and answers. Existing CTGQA methods typically treat Video Temporal Grounding (VTG) and Video Question Answering (VQA) as separate tasks, overlooking their inherent semantic and temporal complementarity. As a result, VQA models often generate ambiguous answers due to the lack of precise temporal cues, while VTG models fail to fully exploit the high-level semantic information embedded in the answers. To address these limitations, We propose a Reciprocal Interaction Framework (RIF). RIF employs a two-branch interaction structure to enhance the performance of both VTG and VQA. RIF consists of two modules: Localization-Guided Answering (LGA) and Answer-Enhanced Temporal Grounding (AETG). The LGA module assists the VQA model in generating high-quality answers by highlighting relevant segments while minimizing the influence of irrelevant content. To mitigate model overconfidence, we propose a progressive feature fusion strategy that dynamically adjusts the weights of relevant segments, thus preventing localization errors. The AETG module leverages additional information embedded in the generated answer to improve VTG performance. Moreover, we employ a perplexity-based filtering strategy to ensure the reliability of the answer. Extensive experiments show that our framework performs well on the QAEGO4D and Ego4D-NLQ benchmarks.

More from our Archive