Improving Data Leakage Detection in Machine Learning Notebooks through Static Slicing and Structured LLM Prompts

doi:10.1145/3808199

DOI: 10.1145/3808199 ISSN: 2994-970X

Improving Data Leakage Detection in Machine Learning Notebooks through Static Slicing and Structured LLM Prompts

Taha Draoui, Mohamed Wiem Mkaouer, Christian Newman

Data leakage remains a critical yet under-diagnosed issue in machine learning pipelines, leading to inflated results and unreliable deployments. Existing detection approaches rely on static rules that often miss open-coded manipulations and fail to capture the diversity of real-world notebooks. This paper introduces a novel methodology that integrates static slicing with large language models (LLMs) to improve leakage detection. We use a Datalog-based static analysis that isolates compact, provenance-aware slices corresponding to model training and evaluation pairs, and we pair these with structured LLM prompts that guide step-by-step reasoning about potential leakage for each isolated slice. Evaluated on a curated benchmark of Python notebooks from Kaggle and GitHub, our approach achieves state-of-the-art performance in both preprocessing and overlap leakage detection, improving F1 scores over the previous state-of-the-art by 22% for preprocessing leakage and 15% for overlap leakage. Beyond these improvements, our slicing-based methodology substantially outperforms end-to-end prompting, demonstrating that precise program slicing is key to enabling LLMs to reliably detect leakage. Our findings highlight the effectiveness of combining program slicing and prompt engineering for data leakage detection and establish the first systematic LLM-based solution for detecting data leakage in machine learning code.

Outline

Improving Data Leakage Detection in Machine Learning Notebooks through Static Slicing and Structured LLM Prompts

More from our Archive