Restoring grokking under structured label noise: audit efficiency and optimization-history dependence
Dylan AshrafLabel noise is a pervasive problem in real-world machine learning, yet most theoretical and empirical treatments assume noise is distributed uniformly across the training set — an assumption rarely satisfied in practice. This study examined two questions about grokking (delayed generalization) under structured label noise: whether grokking recovery depends on the trajectory of training rather than only on the final corrected dataset, and how efficiently a fixed annotation budget can be spent to restore grokking. Using the modular addition task (predicting (a + b) mod p for p = 97) as a controlled testbed, the central result is that grokking recovery is path-dependent. In temporal experiments, when corrupted labels were corrected midway through training and optimization continued from the existing state, models grokked 10,000–16,000 epochs faster than identically-corrected models that were reinitialized and retrained from scratch — consistently across all seeds and both noise types. Because the corrected dataset was identical in both cases, recovery speed cannot be explained by the data alone, indicating that optimization history influences subsequent generalization — a hysteresis effect not previously reported in the grokking literature. In parallel, an audit-budget framework was used to compare annotation-review policies, each selecting B examples to review from the full training set without prior knowledge of which labels were wrong. Loss-based triage achieved ~45% hit rate at a budget of just 4% of the training set (a ~15× improvement over random auditing) and was the only policy that reliably restored grokking (3/3 seeds at B = 400); its advantage held across corruption rates η ∈ {0.01, 0.04, 0.08, 0.12} and generalized to a softmax-entropy scorer. Region-targeted auditing helped only when its assumed error geometry matched the true noise distribution, and conditions were found in which the identity of the corrected examples mattered more than their count. These results extend the study of grokking beyond whether label noise suppresses generalization to how and when generalization can be restored, and introduce optimization history as a variable in grokking dynamics.