DOI: 10.1177/20552076261462744 ISSN: 2055-2076

From reporting gaps to hospital cost drivers to enhance digital health decision making: A machine learning-assisted analysis of national hospital data

Jiahui Ma, Ian Laga, Elizabeth Johnson

Objective

Hospitals across the United States face growing operational and financial strain, resulting in closures that threaten healthcare access and system resilience. This study aimed to identify significant predictors of hospital total facility expenditures and to evaluate the performance of multiple imputation methods for incomplete data in the American Hospital Association (AHA) Annual Survey Database.

Methods

The de-identified 2022–2023 AHA survey data (n=12,359) comprising 34 financial, structural, and operational features was analyzed. Missing data were addressed using the Multivariate Imputation by Chained Equations (MICE) framework, comparing regression-based and machine learning–based algorithms. Random Forest (RF) imputation was selected for its superior accuracy based on fivefold cross-validation. Linear regression models were fitted on five RF-imputed datasets to identify key determinants of total facility expenditure (EXPTOT).

Results

RF-based imputation achieved the lowest error and highest consistency across variable types. Regression results identified full-time registered nurses (FTRNTF), facility size (GFEET), and property and equipment costs (PLNTA) as the strongest predictors of hospital expenditure (p<0.001). Hospitals with community designations, oncology or research services, and Joint Commission accreditation had significantly higher expenditures, whereas rural and community trauma centers reported lower costs. Geographic visualization revealed substantial disparities in hospital resources and expenditures, especially in rural areas.

Conclusion

Machine learning–based multiple imputation improves data completeness and modeling accuracy for hospital operations research. Findings highlight critical cost drivers and geographic inequities, informing data-driven policymaking and resource allocation in health system management.

More from our Archive