A Systematic PRISMA Survey on Fault-Tolerant DNN Accelerator Architectures for Safety-Critical Systems
Farah Natiq Qassabbashi, Shawkat Sabah Khairullah, Shefa A. DawwdDeep Neural Networks (DNNs) are increasingly being used in the design of industrial safety-critical autonomous applications such as autonomous vehicles, industrial robotics, and medical instrumentation and control systems. Ensuring reliable and robust operation of the DNN-based safety-critical systems is challenging because of the complex structure of DNN hardware accelerators utilized for inference that are susceptible to the effects of multi-faults, common-cause fault models, data uncertainties, and unpredictable erroneous behavior. Additionally, transient, permanent, and timing faults affect the accelerator design of processing elements, memory arrays, and datapaths, propagate through DNN computations, and potentially can cause catastrophic failures at the system level. The objective of this survey paper is to systematically evaluate the state-of-the-art fault-tolerant DNN accelerator architectures with particular emphasis on their applicability to safety-critical autonomous systems in industry. The survey investigates architectural perspective, fault modeling, and platform-level trade-offs, runtime resilience, validation practices, and certification readiness, following a PRISMA methodology with evidence-driven synthesis and unbiased study selection. Database searches across IEEE Xplore, Scopus, and Web of Science identified 200 records, of which 82 studies were included based on predefined inclusion and exclusion criteria emphasizing industrial safety-critical relevance, fault modeling at the hardware level, and the implementation at the architectural level. The results indicate that there was a clear shift from traditional redundancy-based approaches to cross-layer and adaptive approaches that provide better trade-offs between performance, reliability, and hardware overhead. The current studies presented are based on simplified fault models, incomplete validation- procedures, and limited consideration of system-level and certification needs, which often do not consider critical failure modes such as Silent Data Corruption (SDC). This has resulted in a significant gap between research-level solutions and industrial deployment requirements. This survey underscores the need for scalable, integrated, and certification-aware design approaches to help connect fault modeling, architectural resilience, validation, and safety assurance to develop reliable and deployable DNN accelerator systems for next-generation industrial safety-critical autonomous applications.