Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading
Hanzhen Lu, Lishui Fan, Jiachi Chen, Qiuyuan Chen, Zhao Wei, Zhongxin LiuLine-level code completion aims to complete the current line in real-time as developers type. Low latency is crucial to maintaining a seamless and uninterrupted coding experience, enabling developers to remain in a productive flow. However, existing approaches face a fundamental trade-off: large language models (LLMs) provide high-quality suggestions but require expensive computational resources to ensure acceptable inference latency. In contrast, static-analysis-based methods and small language models respond quickly but often generate suboptimal completions. To fill this gap, our idea is to rely on the small model by default and only escalate the large model when necessary to achieve latency-accuracy trade-offs. Based on this idea, we propose MCCom(Model-Cascading-based code Completion), a framework that cascades a local small model with a high-performance cloud large model for code completion. Realizing effective model cascading requires answering two non-trivial questions, i.e., when to invoke the large model and how to enable effective collaboration between small and large models. For the first question, we leverage a valuable but easily overlooked signal, i.e., user actions, during code completion to accurately identify failed completions. This deferral decision allows us to invoke the large model only when necessary, reducing both latency and cloud-side computation costs. To enable effective collaboration, MCCom employs a two-stage speculative decoding strategy and an iterative retrieval mechanism that collectively accelerate and improve the quality of completions. Due to the lack of high-quality small models for code completion, we also train a lightweight model with only 121M parameters to implement MCCom. The small model achieves an average of 73.8% of the performance of the state-of-the-art 7B model. We evaluate MCCom on the RepoEval benchmark and a new benchmark, StmtEval, collected from real-world projects. Experimental results show that our approach not only reduces inference latency by up to 47.9% and cuts down LLM usage by an average of 46.3%, but also improves the exact match rate of the large model by an average of 8.9%.