OTalign: Optimal Transport Alignment for Remote Protein Homologs Using Protein Language Model Embeddings
Minsoo Kim, Hanjin Bae, Gyeongpil Jo, Kunwoo Kim, Jejoong Yoo, Keehyoung JooAbstract
Motivation
Protein sequence alignment is a crucial task in bioinformatics, yet aligning remote homologs with low sequence identity remains a longstanding challenge, particularly due to the difficulty of handling gaps. We introduce a new method that applies Optimal Transport (OT) theory to sequence alignment, providing a mathematically principled framework for modeling residue matches and gaps.
Results
OTalign formulates sequence alignment as an entropy-regularized unbalanced optimal transport (UOT) problem over embeddings derived from protein language models (PLMs). Unlike traditional methods, it introduces position-specific gap penalties that adapt to each sequence pair. On challenging remote-homolog benchmarks (SABmark, MALIDUP, MALISAM), OTalign consistently outperforms baselines (Needleman-Wunsch, HHalign) and recent PLM-based methods (PLMAlign, DeepBLAST), achieving F1 scores of 0.594 on SABmark Superfamily and 0.358 on SABmark Twilight. Furthermore, OTalign provides a quantitative and interpretable metric of how effectively PLM embeddings represent sequence similarity relationships. Finally, its differentiable nature enables end-to-end fine-tuning of PLMs, establishing a framework for learning embeddings explicitly optimized for alignment tasks.
Availability and implementation
This code is available at https://github.com/DeepFoldProtein/OTalign.
Supplementary information
Supplementary data are available at Bioinformatics online.