TR-ABFT: Tile-Resilient Fault Detection for Neural Processing Units
Yang Hua, Yunhong Bai, Bo Wang, Wei Zhuang, Yuanfu ZhaoSpaceborne neural processing units (NPUs) increasingly support real-time deep-learning inference, but their dense multiply-accumulate arrays are vulnerable to radiation-induced soft errors. Conventional radiation-hardening methods improve reliability through hardware redundancy, but they incur substantial area, performance and compiler-mapping overheads. This paper proposes tile-resilient algorithm-based fault tolerance (TR-ABFT), a software-scheduled, detection-oriented scheme for quantized NPU inference. TR-ABFT generates checksum information at tile granularity and maps checking tasks onto the original processing element (PE) array without changing the hardware topology. To make ABFT compatible with INT8 datapaths, we design two checksum-coding strategies: checksum decomposition and modulo-239 checksum coding. The modulo-239 scheme removes structural missed detections for two-bit flips with bit-position spacings in (1, 31), while preserving compatibility with signed INT8 inputs. Evaluations on ResNet, YOLOv8, and RT-DETR show that, on a 16×16 array, TR-ABFT introduces only 6.37% to 24.61% additional computational overhead. By converting spatial redundancy into schedulable temporal redundancy, TR-ABFT preserves systolic-array regularity and provides a low-overhead reliability-enhancement mechanism for space-grade neural-network accelerators.