Understanding Code Similarity across Instruction Set Architectures: An Empirical Study
Haonan Yu, Jiaxin Zhu, Yingying Zheng, Yuwei Zhang, Wei Wang, Jun Wei, Tao HuangSoftware interacts with hardware through Instruction Set Architectures (ISAs), such as x86, ARM, and RISC-V. Although many developers may be unaware of ISA heterogeneity, ISA-specific code is pervasive in foundational software systems that underpin the digital infrastructure of human society. Maintaining separate implementations is common when supporting multiple ISAs in such a foundational software project. This may introduce substantial additional effort. Meanwhile, separate ISA-specific implementations frequently exhibit code similarities across ISAs. While prior code similarity research has largely focused on general-purpose clones or cross-language settings, similarity in ISA-specific implementations remains underexplored. To understand ISA-specific code and their similarities, and to gain insights for better management, we conducted an empirical study of 20 open-source foundational projects that support multiple ISAs. We confirmed the need for separate ISA-specific implementations by identifying the roles and characteristics of large-scale ISA-specific code, with assistance from large language models (LLMs). Our analysis of the ISA-specific code revealed a weighted average similarity of 21.7% across ISAs. We also observed cross-ISA co-change and cross-ISA participation patterns in the development and maintenance of ISA-specific code. By centering on ISA-specific implementations rather than general-purpose clones, this study provides a dedicated empirical characterization of a practically important but underexplored code-similarity setting, yielding evidence that can inform both researchers and practitioners working on ISA-related software engineering.