A Two-Stage Coarse-to-Fine Framework for Sparse Crowd Density Prediction in Digital Twin-Based Safety Monitoring
Younghwan Jeong, SoHyeon Kim, Jinyoung Lee, Donghoon Lee, Taemin Hwang, Won Gi ChoiCrowd-related disasters in dense public spaces unfold into hazardous situations within seconds, repeatedly demonstrating that reactive response alone is insufficient to minimize damage. This reality has intensified the need for monitoring systems that can proactively forecast congestion before it reaches a critical level. Digital twin platforms address this need by providing an operational substrate that represents crowd states on a unified bird’s-eye-view (BEV) grid, on which a predictive module can forecast where congestion will emerge. However, conventional AI-based single-stage dense prediction models are intrinsically ill-suited to this role: although crowd congestion is sparse in both space and time, these models apply uniform high-resolution computation across the entire BEV domain, wasting computation and biasing optimization toward dominant background regions. In this paper, we propose a two-stage coarse-to-fine framework that operates as the predictive module of the digital twin and explicitly exploits the spatio-temporal sparsity of crowd congestion. The first stage, CoarseSTFormer, performs efficient global screening on a low-resolution BEV input to coarsely identify a set of density-critical candidate regions. The second stage, SparseQueryDecoder, selectively reconstructs high-resolution responses only on the identified candidates, rather than uniformly upsampling the entire BEV grid. In simulation environments with up to 20,000 pedestrian agents, the proposed framework matches the strongest dense baseline in reconstruction quality while delivering the most balanced variance profile across grid scales. At inference, it further reduces GPU energy consumption by 1.9× to 5.0× and computational cost (FLOPs) by 3.8× to 54×, demonstrating its practicality as a resource-efficient predictive module that satisfies both accuracy and efficiency.