DOI: 10.1145/3821433 ISSN: 1049-331X
Streamlining Repository Tasks with Effective Snippet Retrieval
Tangzhi Xu, Cong Li, Zhaogui Xu, Yanyan Jiang, Yuan Yao, Xiaorui Zhu, Feng Xu, Peng Di, Chang Xu, Zhendong Su
Repository-level software engineering tasks are increasingly automated using repo-level retrieval-augmented generation (RLRAG), where a retriever selects relevant snippets from a repository to assist a language model (LM) in completing tasks. However, existing retrievers often lack effective designs to support LMs of varying capacities. To bridge this gap, we introduce RepoET, a novel retriever for RLRAG. RepoET organizes LMs and tools into an agentic workflow that closely mimics human search logic: “
search-files
\(\to\)
filter-files
\(\to\)
search-snippets
\(\to\)
filter-snippets
”. In our evaluation, RepoET outperformed state-of-the-art retrievers by accurately retrieving more relevant snippets in a better order across two widely used datasets, SWE-bench Lite and RepoQA, utilizing four LMs of different capabilities and sizes (as small as 3B). RepoET improved recall by over 12% and precision by over 20%, yielding a 21% improvement in Acc@k, along with superior snippet ordering. These retrieval improvements led to significant gains in downstream tasks: (1) we resolved 136 issues (45.33%) in SWE-bench Lite with GPT-4o without any additional information, and further improved the success rate by
\(\sim\)
12% with multiple attempts; (2) we improved the accuracy of searching for designated functions by over 23% in RepoQA.