DOI: 10.1145/3820047 ISSN: 1539-9087

FaStrA: Fast and Efficient Network Structure-Aware Layer Fusion on Deep Learning Accelerators

Jungwoo Choi, Woohyun Joo, Albert Gafiyatullin, Sergei Karpukhin, Kihyoun Kwon, MinSeong Kim

With the growing demand for on-device AI, neural processing units (NPUs) have become essential components in mobile system-on-chips (SoCs). Layer fusion is a key technique that minimizes off-chip memory access by grouping multiple layers and forwarding intermediate data through on-chip scratchpad memory. As deep learning models continue to grow in size and complexity, efficient search algorithms are necessary to explore the vast combinatorial space of layer fusion optimizations. Existing methods often rely on localized heuristics to reduce this search space, which can lead to suboptimal solutions by failing to consider the global network structure. In this paper, we propose FaStrA, a novel layer fusion algorithm that effectively finds optimal fused layer groups by leveraging the structural information of neural networks. Instead of exhaustively exploring all layer combinations, FaStrA constructs the search space based on skip connection structures, leveraging the dependencies introduced by skip connections during fused layer group formation. Additionally, to minimize DRAM access at the boundaries between fused layer groups, the algorithm prioritizes layers with smaller output feature map sizes during the search process. To further accelerate the search, FaStrA terminates the tile-size search early once pipelined execution yields no additional performance gain. Experimental results on real-world models show that the proposed method achieves a 32.9% speedup and a 40.5% reduction in off-chip memory traffic on a commercial mobile NPU while maintaining acceptable compilation time.

More from our Archive