Hypothesis-Informed Feature Stability Scoring for High-Dimensional ETL Pipelines
Konstantin Piryankov, Iveta Grigorova, Aleksandar Karamfilov, Aleksandar EfremovHigh-dimensional financial Extract–Transform–Load (ETL) pipelines often contain heterogeneous variables whose statistical properties may change between recurring data deliveries, affecting feature reliability before downstream machine learning models are trained. This study extends a previously proposed Canberra-based data drift monitoring framework by introducing a hypothesis-informed feature stability component for automated feature assessment and prioritization. Unlike the prior descriptive framework, which relied on univariate and bivariate exploratory metrics, the proposed extension adds an inferential layer and evaluates how this layer changes feature ranking relative to the original score and alternative marginal drift measures. The method combines univariate deviations in summary statistics, bivariate deviations in dependency-related metrics, and hypothesis-based evidence from Anderson–Darling, Mann–Whitney U, and Levene tests. The resulting p-values are aggregated using a Landau-calibrated harmonic mean p-value formulation and transformed into a bounded hypothesis score, which is integrated into a composite variable-level stability ranking. The framework operates on precomputed exploratory data analysis (EDA) outputs, enabling scalable comparison between a validated reference dataset and a current ETL delivery. The proposed extension provides an interpretable and computationally efficient mechanism for identifying unstable features and supporting feature review, exclusion, or prioritization in automated machine learning pipelines.