Collaborative Vision-and-Language Navigation for UAVs in Low-Altitude Urban Space Leveraging Embodied Multi-Agent Systems
Dongyang Wang, Jiankun Shi, Yantao Lu, Jinchao Chen, Chenglie DuLarge vision–language models have advanced embodied navigation by integrating visual perception with natural-language reasoning. However, vision-and-language navigation (VLN) for unmanned aerial vehicles in low-altitude urban airspaces remains challenging due to occluded views, dynamic layouts, limited communication bandwidth, and partial observability. Existing methods mainly focus on single-agent egocentric navigation and lack explicit modeling of uncertainty and inter-agent dependencies in collaborative multi-UAV settings. We propose Collaborative Low-Altitude Space Navigation (Co-LASN), a dynamic Bayesian network-based framework for collaborative VLN in embodied multi-agent systems. Co-LASN jointly models environmental dynamics, linguistic constraints, and inter-agent dependencies in a unified probabilistic representation, allowing each UAV to update its belief state and incorporate information from neighboring agents when making navigation decisions. Experiments on a low-altitude subset of the HaL-13k benchmark show that, under the evaluated simulation protocol, Co-LASN achieves higher navigation metrics than single-agent and partially collaborative baselines. In the 3-agent setting, Co-LASN increases the any-success rate (ASR) from 12.37% to 15.23% and reduces the min navigation error (MNE) from 99.86 to 89.46. These results demonstrate the relative effectiveness of belief-aware collaboration within the evaluated simulation setting.