Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair
Zaoyu Chen, Haoran Qin, Nuo Chen, Xiangyu Zhao, Lei Xue, Xiapu Luo, Xiao-Ming WuSmart contracts, predominantly written in Solidity and executed on blockchains like Ethereum, are immutable, making functional correctness paramount: once deployed, bugs and vulnerabilities become permanent. Despite rapid progress in transformer-based code LLMs, existing evaluations of Solidity code completion rely heavily on surface-form metrics (e.g., BLEU, CrystalBLEU) or hand-grading, which poorly correlate with functional correctness. Unlike Python, Solidity lacks large-scale and execution-based benchmarks, hindering systematic assessment and optimization of LLMs for smart contract development.
To bridge this research gap, we introduce SolBench, a comprehensive benchmark and automated testing pipeline for Solidity, designed to emphasize functional correctness via differential fuzzing. SolBench contains 28,825 functions from 7,604 contracts collected from Etherscan (genesis to 2024), spanning 10 popular domains. We benchmark 14 diverse LLMs (open/closed, 1.3B to 671B parameters, general/code-specific, with/without reasoning). The dominant failure mode is missing crucial details (e.g., type definitions, state variables) in intra-contract context. Providing full-contract context mitigates this and improves code completion accuracy.
However, full-context inference can be prohibitively expensive in practice. Generating outputs with large context windows using state-of-the-art models often incurs significant costs, rendering naive context scaling economically impractical. Crucially, most of a contract is irrelevant to implementing a given function; only a small subset of details is needed. To exploit this, we propose Retrieval-Augmented Repair (RAR), which integrates retrieval into code repair: it uses the executor's error messages to extract only the most relevant snippets from the full contract. RAR sharply reduces input length for function completion, improving accuracy while significantly cutting computational cost. We further analyze retrieval and code repair strategies within RAR, showing substantial improvements in accuracy and efficiency. SolBench and our RAR framework enable principled evaluation and cost-effective improvement of Solidity code generation. Dataset and code are available at https://github.com/ZaoyuChen/SolBench.