DOI: 10.1145/3808104 ISSN: 2994-970X

SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

Lei Yu, Jingyuan Zhang, Xin Wang, Li Yang, Fengjun Zhang, Jiajia Ma

Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. In the task of automated smart contract generation using Large Language Models (LLMs), this challenge is amplified by two interconnected failures: first, they operate as unauditable "black boxes" by failing to produce a transparent reasoning process, and second, as a consequence, they generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 based on Qwen2.5-Coder-7B, a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the base model on the nuances of smart contract code. To construct the data for subsequent stages, we first prompt the DeepSeek model to generate reasoning-and-code samples from verified on-chain contracts, followed by a rigorous validation process where each sample is manually reviewed by security experts for compilability, functionality, security, and reasoning completeness. Based on this, we then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 of these expert-validated samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy using 1,691 samples by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 18 state-of-the-art baselines on a challenging benchmark of 756 real-world functions from 289 deployed contracts, SmartCoder-R1 establishes a new state of the art by achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).

More from our Archive