DOI: 10.1145/3808198 ISSN: 2994-970X

One Size Does Fit All: Exploring Model Fusion for Software Engineering Tasks

Yinggang Qiu, Yihao Qin, Mingyang Geng, Shangwen Wang, Dezun Dong

Large language models (LLMs) have achieved remarkable performance in software engineering (SE), and fine-tuning LLM for specific SE tasks has gradually become a new paradigm. However, storing fine-tuned checkpoints for multiple tasks incurs heavy storage and deployment complexity. Model fusion, which operates on fine-tuned parameters, offers excellent parameter compression and scalability, yet its effectiveness in the SE domain remains underexplored, making such an investigation essential for guiding the development of customized fusion techniques for the SE domain. To bridge this gap, we conduct a systematic study of model fusion in the SE contexts and reveal the following major findings: (1) when fusing programming languages (PLs) within the same task, model fusion usually works well and can enhance the performance of PLs with fewer data when PLs share similar features. (2) when fusing SE tasks of the same category within a same PL, all methods except TALL-Masks generally suffer substantial performance degradation on specific tasks; (3) when fusing SE tasks of different categories across different PLs, all existing model fusion methods exhibit significant performance degradation on certain tasks. In our evaluation results, TALL-Masks, which introduces a mask for each task to extract the most relevant dimensions from the fusion parameters, achieves promising performance. However, during parameters fusion, weak features (i.e., small variation in fine-tuned parameters) are easily overshadowed by strong ones (i.e., large variation in fine-tuned parameters) during parameter fusion, causing the constructed masks to fail to extract the most relevant parameters. To overcome this situation, we propose an improved version of TALL-Masks, called Scaling-Masks. The key idea is to amplify weak features to prevent them from being overshadowed by strong ones, which is achieved by scaling the value range of weak features to match that of strong features. Experimental results demonstrate that Scaling-Masks can significantly improve fusion performance for tasks with extremely weak features without affecting other tasks, with normalized accuracy improved by 63.49% for vulnerability detection when fusing SE tasks of different categories and 24.02% for PHP when fusing PLs in the code repair task.

More from our Archive