A Two-Stage Hurdle Gradient-Boosting Framework for Zero-Inflated Customer Lifetime Value Prediction and Segmentation
Chung-Yi Lin, Yuh-Min Chen, Chia-Chen Kuo, Chun-En Yen, Yu-Yao LoThis study proposes a two-stage Hurdle machine-learning framework for Customer Lifetime Value (CLV) prediction under zero-inflated non-contractual retail settings, where conventional single-stage approaches may suffer from prediction instability and retransformation issues when zero and non-zero spending are jointly modeled. Using the UCI Online Retail II dataset, comprising 4026 customers with a 62.5% zero-spending rate, Stage 1 employs XGBoost to estimate purchase occurrence probability, while Stage 2 applies gradient- boosting regressors to predict conditional spending intensity. The inverse hyperbolic sine (arcsinh) transformation handles 59 customers with negative net spending from product returns. The Two-stage CatBoost model achieves a coefficient of determination of 0.522, outperforming the best single-stage mean-squared-error (MSE) model (0.385), the default Tweedie-loss baseline in the main 30-seed comparison (0.309), and the Beta-Geometric/Negative Binomial Distribution (BG/NBD) baseline (0.395). The contribution combines architectural innovation with a comprehensive validation protocol—including 5 × 2 CV paired t-tests, out-of-time validation, and SHAP interpretability—confirming that purchase frequency drives occurrence probability while monetary value dominates spending magnitude. A dual-dimension segmentation based on purchase probability (P) and conditional spending intensity (E) identifies 96 Dormant, High-E customers with only a 26% purchase rate despite high expected spending, demonstrating that high conditional spending does not guarantee purchase occurrence.