Revealing digitally invisible groups through a machine learning approach using multi-source data
Wenlan Zhang, Chen Zhong, Faith Taylor, Yan Liu, Mark PellingBig data has emerged as a critical instrument for urban planning and development decision-making. However, the reliability and representativeness of big data constrain its utility. Availability of big data varies significantly across different space, time and socio-demographic groups, particularly in the Global South. This leads to the existence of digitally invisible groups – those who cannot contribute to and benefit from digital data-informed decisions – resulting in the deepening of existing inequalities and further marginalising those already excluded populations. This study presents an example application using land use classification with data from different sources in a developing country context, to explore how certain community groups may be systematically underrepresented or overlooked in specific data and applications. We combine traditional geospatial data (satellite imagery, nighttime light imagery, building footprints) with large-scale, digitally generated data sources (geotagged Twitter posts, street view imagery), and apply a stepwise data integration approach using a random forest classifier. We focus on class-specific changes in performance to infer patterns of uneven data representation. By comparing model outputs across different data combinations, we assess how the inclusion or exclusion of specific datasets influences classification performance. Results indicate that informal settlement areas are underrepresented in geotagged Twitter data, and inaccessible neighbourhoods are poorly captured by street view imagery. Our findings show that reliance on a single data source can reinforce biases, while integrating complementary datasets can partially mitigate these gaps when guided by systematic evaluation. We recommend targeted primary data collection and participatory mapping to address persistent blind spots and improve the inclusiveness of data-informed urban decision-making.