DOI: 10.3390/buildings16132517 ISSN: 2075-5309

Explainable Multi-Factor Cost Overrun Prediction Using an Integrated Construction Dataset: A SHAP-Based Analysis of Cross-Domain Interactions

Joosung Lee, Wonjun Park

Cost overrun remains a pervasive issue in building construction projects, yet most predictive studies operate within a single data domain, ignoring the systemic interactions across project, schedule, resource, quality, and safety dimensions. This study quantifies the incremental predictive value of integrating these five construction data domains and identifies the cross-domain interaction patterns that explain prediction accuracy. As a simulation-based methodological study, an integrated dataset of 100,000 records was synthesised with theory-grounded causal structures derived from the construction management literature; no real project data were used. Gradient Boosting (GB), Random Forest (RF), and Linear Regression were evaluated on an 80/20 hold-out test split, with robustness verified through alternative domain orderings and hyperparameter sensitivity. SHAP analysis, including exact interaction values, was used to interpret feature importance and cross-domain synergies. The full five-domain GB model achieved R2 ≈ 0.97 and MAPE ≈ 6%, a 220% relative R2 improvement over the Project-domain baseline (R2 rising from 0.305 to 0.975), robust across three ordering schemes. Schedule and Quality contributed the largest marginal gains (ΔR2 = +0.312 and +0.255), whereas Resource integration yielded approximately one-thirty-first of Schedule’s return. Because the dataset is synthetic, the results are interpreted as a methodological demonstration rather than empirical evidence from real projects; they provide a reusable framework for prioritising data-integration investment and show that, within the simulated causal structure, cross-domain interactions—particularly Schedule × Risk and Project Type × Change Cost—carry predictive information that single-domain analyses cannot recover. Validation on real, partially integrated datasets is identified as essential future work.

More from our Archive