Game-Theoretic Multi-LLM Collaboration for Attribute-Aware Open-Vocabulary Object Detection
Risen Sheng, Jinming Pan, Zhuo Zeng, Hao Chen, Wenzhi CaoOpen-vocabulary object detection (OVD) fails at attribute-level discrimination: when instances share a class label yet differ in color, material, or texture, category names provide no appearance-specific cues. Prior attempts to enrich text inputs with LLM-generated descriptions are limited by single-model distribution bias, producing coverage gaps and unstable attribute quality. We propose a Concept Expander framework built on cooperative multi-LLM game theory. Three heterogeneous LLMs generate candidate attributes in parallel; a cooperative Nash equilibrium then selects the final subset by maximizing each model’s minimum utility gain, jointly enforcing semantic quality and cross-source diversity without amplifying any single model’s bias. The resulting Concept Repository contains approximately 5000 discriminative visual priors. A lightweight retrieval module injects the top-k matched attributes into region-level visual features via residual fusion, preserving CLIP’s pretrained alignment while enriching instance representations with fine-grained semantic priors. A semantic consistency loss anchors enhanced features to ground-truth class semantics throughout training. On LVIS, rare-category APr rises from 22.2 to 28.5; on RefCOCO, attribute-conditioned localization accuracy reaches 54.8, confirming that structured multi-LLM semantic priors improve discrimination across long-tail and high-confusion benchmarks.