DOI: 10.1002/stvr.70026 ISSN: 0960-0833

Rebalancing Software Defect Datasets via Mutation: Performance Insights From Prediction Models Based on Software Measures

Dinçer Güner, Görkem Giray, Onur Demirörs

ABSTRACT

Highly imbalanced training datasets considerably degrade the performance of software defect predictors. Software defect datasets tend to have defective samples as the minority class. We rebalance training datasets by creating additional defects using mutation operators, an approach we call the mutation‐based approach (MBA). We aim to assess the performance of defect predictors built using software measures (such as cyclomatic complexity) obtained from the training datasets rebalanced by MBA. We conducted experiments using 27 releases of nine open‐source projects obtained from the PROMISE repository. We mutated these datasets using MBA. We built predictors using training datasets that were unchanged (baseline), mutated and balanced using five other oversampling methods. We compared the performance of these predictors, built using software measures, across the inter‐release defect prediction (IRDP) and cross‐project defect prediction (CPDP) scenarios. No single combination of ML algorithm and rebalancing method consistently performed best across all performance measures in either the IRDP or the CPDP scenario. In the IRDP scenario, MBA achieved the highest median recall for four of the seven datasets evaluated, while in the CPDP scenario it consistently achieved the highest recall across all datasets, outperforming the baseline across all ML algorithms except naïve Bayes. However, these recall gains were accompanied in both scenarios by increased false alarm rates and reduced precision relative to the baseline, representing a shift along the recall–precision trade‐off curve rather than an improvement in overall prediction quality. MBA did not outperform the baseline or other sampling techniques on MCC and AUC‐ROC for most datasets. The performance profile of any rebalancing approach therefore depends substantially on the dataset, the ML algorithm and the performance measure prioritized, making the choice among methods an inherently context‐dependent decision. Although MBA did not consistently outperform the other methods across all performance measures, our hypothesis of rebalancing training datasets through mutations merits further investigation. Future work includes experiments using more diverse sets of mutation operators and measures, datasets coded in different programming languages and other code representation schemes, such as Abstract Syntax Trees and images.

More from our Archive