DOI: 10.1145/3808119 ISSN: 2994-970X

Understanding Code Similarity across Instruction Set Architectures: An Empirical Study

Haonan Yu, Jiaxin Zhu, Yingying Zheng, Yuwei Zhang, Wei Wang, Jun Wei, Tao Huang

Software interacts with hardware through Instruction Set Architectures (ISAs), such as x86, ARM, and RISC-V. Although many developers may be unaware of ISA heterogeneity, ISA-specific code is pervasive in foundational software systems that underpin the digital infrastructure of human society. Maintaining separate implementations is common when supporting multiple ISAs in such a foundational software project. This may introduce substantial additional effort. Meanwhile, separate ISA-specific implementations frequently exhibit code similarities across ISAs. While prior code similarity research has largely focused on general-purpose clones or cross-language settings, similarity in ISA-specific implementations remains underexplored. To understand ISA-specific code and their similarities, and to gain insights for better management, we conducted an empirical study of 20 open-source foundational projects that support multiple ISAs. We confirmed the need for separate ISA-specific implementations by identifying the roles and characteristics of large-scale ISA-specific code, with assistance from large language models (LLMs). Our analysis of the ISA-specific code revealed a weighted average similarity of 21.7% across ISAs. We also observed cross-ISA co-change and cross-ISA participation patterns in the development and maintenance of ISA-specific code. By centering on ISA-specific implementations rather than general-purpose clones, this study provides a dedicated empirical characterization of a practically important but underexplored code-similarity setting, yielding evidence that can inform both researchers and practitioners working on ISA-related software engineering.

More from our Archive