DOI: 10.1177/1088467x261456652 ISSN: 1088-467X

GRAND-HC: Graph-refined author name disambiguation via harmony contrastive learning

Yuanhao Sun, Zhouyang Jin, Yi Xu, Luoyi Fu, Jiaxin Ding, Xiaoying Gan, Xinbing Wang, Chenghu Zhou

From-Scratch Name Disambiguation (SND), a core Author Name Disambiguation (AND) task, aims to group papers with identical ambiguous names into clusters of distinct real-world authors. However, existing SND methods suffer from two critical limitations: first, the inherent long-tailed uneven distribution, where most papers belong to a few prolific authors, severely biases representation learning, causing low-discriminative embeddings and over-merging of tail authors; second, cluster number estimation methods are unreliable and poorly scalable for long sequences, restricting real-world deployment. To address these issues, we propose GRAND-HC, an end-to-end SND framework with three components. We first construct a heterogeneous paper graph based on co-author, co-organization and co-venue relations, and adopt a graph attention network as the backbone. Then, harmony contrastive learning (HCL) dynamically reweights loss to suppress overfitting to prolific authors, learning highly discriminative embeddings. On this basis, a graph-refined distance matrix (GRDM) leverages graph topology to optimize pairwise distances, preventing over-merging of tail authors. Meanwhile, a lightweight Paper Compression Module (PCM) achieves accurate cluster number estimation across varying scales, eliminating the long-sequence modeling defect. Finally, Hierarchical Agglomerative Clustering outputs the final clusters with the optimized distance matrix and estimated cluster number. Extensive experiments demonstrate that GRAND-HC outperforms state-of-the-art models on macro F1 score. Furthermore, GRAND-HC has been deployed in a billion-scale academic database.

More from our Archive