Enhancing Early Academic Outcome Prediction in Small Educational Datasets Through Data Augmentation Techniques
Said El Kafhali, Zakaria Soufiane HafdiEarly prediction of academic outcomes is vital to enabling timely intervention, supporting at-risk students, and improving educational planning and institutional performance. However, this task becomes particularly challenging when data availability is limited, such as in small or graduate-level programs. This study explores the potential of data augmentation techniques, specifically the Synthetic Minority Oversampling Technique, to enhance the performance of machine learning models applied to such constrained educational datasets. We conduct a comparative analysis using four datasets derived from prior research, each representing a distinct educational use case: one focused on predicting academic success in graduate programs, another on student dropout in virtual learning environments, a third on dissertation performance prediction, and a fourth addressing multi-class performance prediction in undergraduate coding courses. By applying consistent machine learning methods in the original and augmented datasets, we systematically evaluate the impact of data augmentation on classification performance using accuracy, precision, recall, and the F1 score. The results demonstrate marked improvements, with accuracy increases up to 21% and precision gains exceeding 25% in some models, notably with KNN and MLP. While not all algorithms benefit equally, our findings highlight data augmentation as a practical and impactful strategy for improving early prediction capabilities in Educational Data Mining (EDM). By leveraging multiple datasets and diverse educational contexts, this contribution provides robust evidence supporting the broader goal of enhancing decision-making and personalized support in digital learning environments.