Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution
Ao-Lin Liu, Yi-Han Xu, Wen ZhouFace super-resolution (FSR) aims to reconstruct high-quality high-resolution face images from low-resolution inputs. Although CNN–Transformer hybrid models have shown promising performance by jointly modeling local textures and global dependencies, their large parameter sizes and high computational costs hinder practical deployment in resource-constrained scenarios such as mobile devices and embedded systems. Meanwhile, existing lightweight SR models usually reduce complexity by simplifying network depth, channel dimensions, or convolutional operations, which may weaken feature representation capability and lead to insufficient recovery of fine facial structures. To address these issues, this paper proposes HCTIUNet, a lightweight CNN–Transformer hybrid network based on an inverted U-shaped architecture. Specifically, the proposed network integrates lightweight CNN branches for local facial texture extraction and Transformer branches for global dependency modeling, while introducing a multi-scale feature interaction strategy and a global feature refinement module to enhance facial structural details. Experimental results on the FFHQ, CelebA, and Helen datasets demonstrate that HCTIUNet achieves competitive performance under the ×8 face super-resolution setting, obtaining PSNR/SSIM/LPIPS values of 27.55 dB/0.765/0.225, 27.63 dB/0.761/0.212, and 27.53 dB/0.777/0.213, respectively. Moreover, HCTIUNet contains 10.5 M parameters, requires 9.9 G FLOPs, and achieves an inference time of 0.021 s. These results indicate that the proposed method achieves a favorable trade-off between reconstruction accuracy, perceptual quality, and computational efficiency, making it suitable for efficient face super-resolution applications.