DOI: 10.1145/3821423 ISSN: 1049-331X

Retrieval-Augmented Unit Test Suggestion Generation

Hongyan Li, Weifeng Sun, Meng Yan

Unit testing is a cornerstone of software quality assurance, yet writing high-quality tests remains labor-intensive and cognitively demanding, especially when developers must infer testing objectives directly from code. Recent advances in large language models (LLMs) have enabled automated unit test generation, but these methods take a focal method as input and generate executable test code without explicitly specifying what behavior is being tested or how its correctness should be assessed. This mismatch with real-world developer workflows introduces cognitive overhead, hampers maintainability, and reduces trust in generated tests. In this paper, we propose

ReaTSG
, a framework for generating natural language test suggestions that explicitly describe from the developer's perspective what behavior should be tested and how it can be validated.
ReaTSG
contains two key components. First, the Test Suggestion Oracle Construction module leverages LLM-based self-verification to automatically generate high-quality test suggestions for given focal methods. Second, the Retrieval-Augmented Suggestion Learning module couples a dense retriever and a salience-aware generator within a joint training strategy, enabling the retriever to select examples that are not only relevant but beneficial for generation, and guiding the generator to incorporate TF-IDF-based salience weights into its attention mechanism to emphasize informationally important tokens. We construct the first large-scale dataset comprising 55,467 pairs of focal methods and their corresponding suggestions. Extensive experiments across diverse LLMs demonstrate that
ReaTSG
achieves the best overall performance. Relative to the fine-tuned baselines, it yields average improvements of 17.37% (C-BLEU), 13.15% (S-BLEU), 5.14% (METEOR), 8.21% (ROUGE-L), and 93.04% (CIDEr). Furthermore, we demonstrate that the test suggestions generated by
ReaTSG
can significantly facilitate the performance of multiple downstream testing tasks.

More from our Archive