Data-Driven Defect Prediction for Manufacturing Quality Monitoring Under Class Imbalance and Missing Data: A Performance–Efficiency Trade-Off Analysis
Jung Kyu Park, Youngmi BaekManufacturing equipment logs are an important source of information for quality monitoring, but building reliable defect prediction models from such logs is still difficult in practice. Defective samples are rare, and many process variables are missing because measurements are recorded only under certain sensing or process conditions. These properties make defect prediction difficult and limit the usefulness of accuracy-based evaluation. This paper evaluates defect prediction using the Bosch Production Line Performance dataset, with a supplementary validation experiment on the semiconductor manufacturing process (SECOM) dataset. Two feature configurations are compared: a baseline representation using imputed numerical variables and a missingness-aware representation that adds feature-wise missing indicators and a sample-level missing ratio. Logistic Regression, Random Forest, and LightGBM are evaluated using validation-based threshold selection. To examine the effect of imputation choice, zero, median, and KNN imputation are also compared in the SECOM experiment. In the Bosch experiment, explicitly representing missingness improves PR-AUC for all tested model configurations. The supplementary SECOM experiment shows a more mixed pattern, suggesting that the usefulness of missingness-aware features depends on the dataset, imputation strategy, and model family. The latency analysis further shows a practical trade-off: Random Forest with missingness-aware features gives the highest PR-AUC on Bosch but has the highest inference latency, while LightGBM provides a more balanced choice when prediction performance and response time are considered together.