Spatially Aware Pair Proposal for Panoptic Scene Graph Generation
Hanzhu Dai, Qiang Zhang, Binghao Wang, Mai LiuImages captured by vision sensors provide visual evidence for scene understanding, including object appearances, pixel-level regions, and spatial relations among entities. Panoptic Scene Graph Generation (PSG) constructs structured scene representations by grounding visual entities with panoptic masks and predicting relationships among objects and regions. In pair-then-relation PSG pipelines, subject–object pair recall is critical to final triplet recall. However, existing pair proposal approaches mainly score candidate subject–object pairs based on object–query feature matching, while mask-derived spatial cues such as object locations, relative geometry, and local layouts remain underexplored. Consequently, ground-truth subject–object pairs may be excluded from the Top-Kr proposals before relation decoding. To address this problem, this paper proposes a Spatially Aware Pair Proposal Model (SAPPM), which incorporates mask-derived soft centroids, relative geometry, and local-neighborhood context into pair scoring. SAPPM uses Grouped Vector Attention (GVA) to model local spatial interactions and introduces a spatially adaptive gating module to calibrate spatial-branch contributions. Experiments on the PSG dataset under the Scene Graph Detection (SGDet) protocol show that SAPPM achieves competitive performance, reaching 32.53 R@20 and 27.36 mR@20. These results indicate that SAPPM improves PSG performance by enhancing ground-truth pair coverage in the candidate proposal set.