The Data Gradient: Quantifying Dataset Influence in End-to-End Machine Learning Systems

doi:10.70389/pjcs.100014

DOI: 10.70389/pjcs.100014 ISSN: 2977-5973

The Data Gradient: Quantifying Dataset Influence in End-to-End Machine Learning Systems

Khadija Kamene

Training data quality is increasingly recognized as a bottleneck in machine learning system performance; yet, principled methods for quantifying the contribution of individual samples remain computationally prohibitive or poorly integrated into practical workflows. We present the data gradient framework, which estimates per-sample influence by combining sample-level gradients with inverse-Hessian-vector products—computed via the LiSSA stochastic approximation for large models and the conjugate gradient (CG) for smaller ones—and projecting the resulting vector S = Hθ−1 ∇θLval (computed once per refinement iteration and reused across all training samples) onto per-sample gradients to produce a scalar influence score Ii = −giT S for each training instance. While the core estimator builds on the influence function formulation of Koh and Liang, the data gradient framework distinguishes itself through three contributions: (i) an end-to-end pipeline architecture embedding influence estimation into iterative dataset refinement loops without offline post-processing; (ii) the dataset contribution score (DCS), a normalized aggregate metric for tracking dataset quality across refinement iterations; and (iii) empirical validation across image, text, and tabular modalities with multiple model families, including neural networks, gradient boosted trees (GBTs), and logistic regression.

Experiments on CIFAR-10, IMDB Reviews, and the UCI Adult dataset show that influence-guided removal of harmful samples improves classification accuracy by 1.3%–2.3% and the F1-score by 0.03–0.04 points relative to unrefined baselines, with effect sizes (Cohen’s d) ranging from 0.41 to 1.12; all comparisons survive Wilcoxon signed-rank tests with Benjamini–Hochberg false discovery rate (FDR) correction across five seeds. Scalability analysis demonstrates that stochastic Hessian-vector approximations maintain influence computation within an approximately 0.44× training time overhead for datasets up to 100,000 samples (empirically validated on a V100 GPU). A sensitivity analysis over three refinement iterations confirms that DCS and test accuracy track each other without evidence of validation-set overfitting within this range. These results suggest that systematic, gradient-based dataset refinement offers a practical complement to model-centric optimization in data-centric machine learning workflows.

Outline

The Data Gradient: Quantifying Dataset Influence in End-to-End Machine Learning Systems

More from our Archive