DOI: 10.1145/3821433 ISSN: 1049-331X

Streamlining Repository Tasks with Effective Snippet Retrieval

Tangzhi Xu, Cong Li, Zhaogui Xu, Yanyan Jiang, Yuan Yao, Xiaorui Zhu, Feng Xu, Peng Di, Chang Xu, Zhendong Su

Repository-level software engineering tasks are increasingly automated using repo-level retrieval-augmented generation (RLRAG), where a retriever selects relevant snippets from a repository to assist a language model (LM) in completing tasks. However, existing retrievers often lack effective designs to support LMs of varying capacities. To bridge this gap, we introduce RepoET, a novel retriever for RLRAG. RepoET organizes LMs and tools into an agentic workflow that closely mimics human search logic: “ search-files \(\to\) filter-files \(\to\) search-snippets \(\to\) filter-snippets ”. In our evaluation, RepoET outperformed state-of-the-art retrievers by accurately retrieving more relevant snippets in a better order across two widely used datasets, SWE-bench Lite and RepoQA, utilizing four LMs of different capabilities and sizes (as small as 3B). RepoET improved recall by over 12% and precision by over 20%, yielding a 21% improvement in Acc@k, along with superior snippet ordering. These retrieval improvements led to significant gains in downstream tasks: (1) we resolved 136 issues (45.33%) in SWE-bench Lite with GPT-4o without any additional information, and further improved the success rate by \(\sim\) 12% with multiple attempts; (2) we improved the accuracy of searching for designated functions by over 23% in RepoQA.

More from our Archive