Machine learning-based property price prediction in an emerging market: evidence from Islamabad, Pakistan
Hassan RazaPurpose
This study develops and evaluates machine-learning (ML) models to predict residential property prices in Islamabad, Pakistan, addressing the acute information asymmetry and valuation opacity that characterize real estate markets in emerging economies. The purpose of this paper is to also offer a theoretical refinement of the hedonic pricing framework for markets characterized by within-neighborhood quality heterogeneity.
Design/methodology/approach
A data set of 116,033 property listings was collected from an online real-estate portal. After preprocessing by including standardization of locally formatted price notations (Crore/Lakh) and heterogeneous area units (Kanal/Marla), a final sample of 59,035 records was analyzed. Six regression algorithms were trained and evaluated in Python (scikit-learn and the XGBoost package): Linear Regression, Support Vector Regression (SVR), Decision Tree, Random Forest, Gradient Boosting and Extreme Gradient Boosting (XGBoost). The best-performing model was further refined through grid-search hyperparameter optimization, and feature attribution was analyzed using both tree-split importance and Shapley Additive exPlanations (SHAP). Predictive accuracy was assessed using mean absolute error (MAE) and root mean squared error (RMSE) on a held-out test set; goodness-of-fit was summarized using the coefficient of determination (R²).
Findings
Tuned XGBoost delivered the lowest test-set error (MAE = PKR 6.54m; RMSE = PKR 23.37m; R² = 0.698, cross-validated R² = 0.711 ± 0.028) and substantially outperformed the linear baseline (test MAE = PKR 19.03m; R² = 0.251). Tuning improved MAE by approximately 13% over the untuned XGBoost benchmark. Feature-attribution analysis yielded a more nuanced picture than the tree-split importance measure alone. Under tree-split importance, bathroom count (31.9%) appears to dominate; under SHAP, listing purpose (25.6%), property area (24.1%) and bathroom count (20.6%) emerge as three roughly comparable joint drivers, with location (14.2%) and bedrooms (10.4%) playing meaningful secondary roles. The authors interpret the bathroom result as one of three coequal price drivers, consistent with a quality-proxy effect in a market where construction quality varies sharply within neighborhoods, rather than as a dominant predictor.
Research limitations/implications
The model uses six features available from online listings; the absence of property age, construction quality and precise geospatial coordinates constrains explanatory power. Label encoding of the location variable does not preserve spatial structure, so the estimated importance of location is a lower bound. The 49.1% attrition because of missing values may introduce selection bias. The analysis is cross-sectional and confined to Islamabad.
Practical implications
With a tuned MAE of approximately 22% of the mean listing price, the model is positioned as a screening-grade automated valuation tool rather than as a stand-alone appraisal. It is suitable for flagging mispriced listings, supporting buyer due diligence and as a starting point for human appraisers. Three feasible policy steps are proposed: a standardized listing schema at the portal level; linking Federal Board of Revenue valuation tables to a regularly updated ML reference; and opening anonymized transaction data from provincial property registries to qualified researchers.
Originality/value
To the best of the authors’ knowledge, this paper provides the first large-scale, six-algorithm benchmark for property valuation in Pakistan. Its theoretical contribution is to refine the Rosen (1974) hedonic framework for quality-heterogeneous emerging markets, where quality-proxy attributes assume a role normally absorbed into the location premium.