DOI: 10.1145/3715776 ISSN: 2994-970X

Clone Detection for Smart Contracts: How Far Are We?

Zuobin Wang, Zhiyuan Wan, Yujing Chen, Yun Zhang, David Lo, Difan Xie, Xiaohu Yang

In smart contract development, practitioners frequently reuse code to reduce development effort and avoid reinventing the wheel. This reused code, whether identical or similar to its original source, is referred to as a code clone. Unintentional code cloning can propagate flaws and vulnerabilities, potentially undermining the reliability and maintainability of software systems. Previous studies have identified a significant prevalence of code clones in Solidity smart contracts on the Ethereum blockchain. To mitigate the risks posed by code clones, clone detection has emerged as an active field of research and practice in software engineering. Recent studies have extended existing techniques or proposed novel techniques tailored to the unique syntactic and semantic features of Solidity. Nonetheless, the evaluations of existing techniques, whether conducted by their original authors or independent researchers, involve codebases in various programming languages and utilize different versions of the corresponding tools. The resulting inconsistency makes direct comparisons of the evaluation results impractical, and hinders the ability to derive meaningful conclusions across the evaluations. There remains a lack of clarity regarding the effectiveness of these techniques in detecting smart contract clones, and whether it is feasible to combine different techniques to achieve scalable yet accurate detection of code clones in smart contracts. To address this gap, we conduct a comprehensive empirical study that evaluates the effectiveness and scalability of five representative clone detection techniques on 33,073 verified Solidity smart contracts, along with a benchmark we curate, in which we manually label 72,010 pairs of Solidity smart contracts with clone tags. Moreover, we explore the potential of combining different techniques to achieve optimal performance of code clone detection for smart contracts, and propose SourceREClone, a framework designed for the refined integration of different techniques, which achieves a 36.9% improvement in F1 score compared to a straightforward combination of the state of the art. Based on our findings, we discuss implications, provide recommendations for practitioners, and outline directions for future research.

More from our Archive