Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming
Yichuan Zheng, Jin Shi, Wei ShenProduct recognition in e-commerce live streaming is hindered by rapid viewpoint changes, occlusions, motion blur, and inconsistencies between visual and spoken information. Existing approaches typically focus on individual components such as detection, OCR, or speech recognition, which limits their effectiveness in end-to-end structured product understanding. To address this problem, we propose an integrated framework that combines task-oriented keyframe selection with multimodal semantic fusion. The framework first uses D-FINE to localize product regions and then selects informative frames through two complementary strategies. Strategy A considers both detection confidence and Laplacian-based sharpness, while Strategy B combines detection confidence with a learned quality component estimated by an EfficientNetV2-M regression model. OCR, visual-semantic recognition, and ASR are then applied to extract complementary evidence, and a Qwen3.5-27B large language model is used to structure and fuse multimodal evidence into standardized product outputs, including brand, product name, and category. Experiments on an in-house e-commerce livestreaming dataset demonstrate substantial gains over a last-frame baseline. Strategy B achieves the best overall result, improving the Perfect Match Rate from 0.609 to 0.775 and the Semantic Similarity from 0.697 to 0.802. Ablation studies further show that the full multimodal framework consistently outperforms unimodal and dual-modality variants under both frame selection strategies. In addition, Top-K analysis indicates that single-frame inference provides a practical balance between OCR evidence completeness and efficiency. Efficiency analysis shows that the per-video API monetary cost remains low under the pricing configuration used in this study, while API latency is mainly limited by Qwen3.5-27B LLM calls for evidence structuring and final fusion. Overall, the proposed framework offers an effective and extensible solution for structured product recognition in complex live-streaming scenarios.