DOI: 10.1145/3797071 ISSN: 2994-970X

Mining Long Tail Bugs: Identifying Rare and Overlooked Issues in Code

Wentao Liang, Yanjun Wu, Xiang Ling, Tianyue Luo, Dinghao Liu, Haotian Zhang, Jingzheng Wu

Using data mining to extract frequent code patterns for bug detection has proven effective. However, prior studies have overlooked the prevalence of infrequent ( rare ) patterns, even though violations of such patterns can also lead to bugs.

In this paper, we present LTMiner, which mines rare patterns from large-scale projects and detects potential bugs by checking for violations of these patterns. In practice, rare patterns far outnumber frequent ones and lack strong statistical support. Consequently, we face a pattern explosion, and many rare patterns and their violations are uninteresting. LTMiner addresses this by using instance-based ranking and filtering to prioritize violations of rare patterns. It further employs a large language model (LLM) as a domain expert to audit top-ranked violations; mined information supports in-context learning, and task decomposition and self-reflection mitigate possible hallucinations. This pipeline effectively curbs pattern explosion and false positives, uncovering previously unknown bugs in large-scale projects at an acceptable cost.

Applied to Linux kernel 6.12.1, LTMiner identified 42 previously unknown bugs, 27 of which have been confirmed by developers. These results indicate that, although rare-pattern bugs are sparse, a considerable number remain and exhibit a non-negligible long tail. We believe that rare-pattern bugs constitute a promising blue ocean for bug detection.

More from our Archive