Balancing Dropout Candidate Coverage and Counseling Burden in University Student Dropout Prediction
Kwan Woo Kim, Cheolgi Kim, Hyeon Gyu KimMany universities have developed machine learning models for student dropout prediction, and these models are commonly evaluated using accuracy-oriented metrics such as the F1 score. However, the model with the highest F1 score does not necessarily provide the most useful dropout candidate list when the institutional objective is to identify actual dropout students under limited counseling resources. This study investigates the trade-off between dropout candidate coverage and counseling burden in university student dropout prediction. We discuss how to determine the target coverage and model-specific classification thresholds required to find a solution model that provides a balanced candidate list. The proposed method uses the model with the highest recall to determine the target coverage and selects the model with the highest precision after threshold adjustment to provide a smaller candidate list under the target coverage. To reduce the risk of test-set-based threshold optimization, the target coverage, model-specific thresholds, and final solution model are determined using only the training/validation data, while the test dataset is reserved strictly for final evaluation. The method was validated using 28 candidate models implemented with various machine learning algorithms and sampling settings. In the main experimental split, the Light Gradient Boosting Machine model with the Synthetic Minority Oversampling Technique provided a candidate list with a dropout candidate coverage of 0.939 and a candidate-list size of 569. In repeated-seed analysis, the proposed method maintained dropout candidate coverage comparable to the threshold-tuned highest-F1 baseline while reducing false positives and candidate-list size. The results suggest that the proposed method provides a practical model selection strategy for improving candidate-list efficiency while reducing the risk of test-set-based threshold optimization.