DOI: 10.1145/3820884 ISSN: 1049-331X

AL-Bench : A Benchmark for Automatic Logging

Boyin Tan, Junjielong Xu, Zhouruixing Zhu, Pinjia He

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language-model-based techniques have been developed to automate log statement generation based on input code. While these tools show promising results in prior studies, fair comparisons of their results are not guaranteed due to the use of ad hoc datasets. In addition, existing evaluation approaches rely solely on code similarity metrics, failing to capture how code diffs affect runtime logging behavior, as minor code modifications can make programs fail to compile and introduce substantial discrepancies in log output semantics. To enhance the consistency and reproducibility of logging evaluation, we introduce AL-Bench, a comprehensive benchmark designed specifically for automatic logging tools. AL-Bench includes a large-scale, high-quality, and diverse dataset collected from 10 widely recognized projects with varying logging requirements. Moreover, it introduces a novel dynamic evaluation methodology to provide a runtime perspective of logging quality in addition to the traditional static evaluation at the source-code level. Specifically, AL-Bench not only evaluates the similarity between the oracle and predicted log statements in source code, but also evaluates the difference between the runtime log files produced by those statements. AL-Bench reveals significant limitations in existing static evaluation, as all logging tools show average accuracy drops of 37.49%, 23.43%, and 15.80% in predicting log position, level, and message compared to their reported results. Furthermore, with dynamic evaluation, AL-Bench reveals that 20.1%–83.6% of the generated log statements cause compilation failures. Moreover, the best-performing tool achieves only 21.32% cosine similarity between the runtime log files produced by the oracle and generated log statements. These results underscore substantial opportunities to advance the development of automatic logging tools. We believe this work establishes a foundation for future research on automatic logging.

More from our Archive