Training PBertKla on an Integrated Multi-Source Dataset with a Machine-Learning Layer for Lysine Lactylation Site Prediction
Seung Beom Jin, Junghee Park, Summer Dabin Lee, Ji Hye Han, Seung-Hyun Myung, Kichul Park, Jisoo YunLysine lactylation (Kla) is a recently discovered post-translational modification implicated in energy metabolism, cellular reprogramming, and disease progression. Here, we train the existing ProteinBERT-based predictor PBertKla on an integrated multi-source dataset and augment it with a lightweight machine-learning (ML) layer over sequence-derived features to predict Kla sites; on a common blind test set, the resulting model (PBertKla + ML) reaches an area under the receiver operating characteristic curve (AUROC) of 0.9126 on the integrated set and is statistically indistinguishable from the strongest available tool (Auto-Kla, DeLong p = 0.74) while significantly exceeding a recent ProtBert-based method (PCBert-Kla, p = 4 × 10−15). Two elements support this result. First, to train and benchmark the model, we assembled and released the largest curated Kla dataset to date, Multi (26,034 samples compiled from nine published sources through a 9-step quality-control pipeline), as a community resource. Second, we validated the model under a leakage-controlled protocol: re-training the complete pipeline under protein-level, 40%-identity homology, and leave-one-study-out splits—each verified to have zero train–test overlap—maintained ≈0.90 AUROC, only 0.6–1.5 percentage points (pp) below the random-split value, confirming genuine generalization rather than memorization. Ablation and SHapley Additive exPlanations (SHAP) analyses locate the predictive signal primarily in the ProteinBERT metafeature, with the ML layer adding a modest but real increment (+0.63 pp over PBertKla alone on Multi; no significant gain on the smaller hepatocellular carcinoma (HCC) set). Finally, an exploratory AlphaFold-based structural case study of FAM210A illustrates how predicted Kla sites distribute across ordered and disordered regions, without claiming a quantitative structure–probability relationship. All trained weights and code are publicly available.