DOI: 10.3390/electronics15132792 ISSN: 2079-9292

Architecture-Level Risk-Guided Fault-Injection Prioritization for Systolic AI Accelerators: A Fixed Candidate-Pool Evaluation

Larisa Goffman-Vinopal

Fault-injection campaigns are widely used to evaluate silent data corruption (SDC) in AI hardware, but exhaustive campaigns over workloads, dataflows, processing elements, and datapath roles are expensive. This paper presents an architecture-level risk-guided fault-injection prioritization method for systolic AI accelerators. The method ranks candidate transient functional perturbations before downstream validation, with the goal of enriching the discovery of candidates that produce a thresholded relative-output-error outcome under a limited validation budget. The evaluation uses a fixed candidate fault pool: all ranking policies score the same 21,000 candidate faults across 30 workload/dataflow/array configurations, corresponding to five GEMM-derived workloads, three array sizes, and two dataflows. Fault magnitudes are sampled once per candidate and are independent of all ranking scores. Candidate faults are modeled as transient architecture-level perturbations in MAC, accumulator, or forwarding paths. The proposed full-risk score combines activity, composite spatial stress, tensor sensitivity, and a path-class weight. In the proposed architecture-level simulation environment and under the fixed-pool protocol, the proposed method achieves the highest mean top-10% SDC-proxy lift, AUPRC, NDCG@10%, and rank correlation with relative output error among the evaluated principle-based ranking policies. At the calibrated threshold, it achieves a mean top-10% lift of 5.65× [4.91, 6.38], compared with 4.61× for AVF-like exposure and 4.33× for output sensitivity. Paired configuration-level tests, threshold sensitivity, and outcome-model sensitivity analyses characterize the result while showing that the proposed score is not universally dominant under every synthetic outcome assumption. The method is intended as a front-end architecture-level screening tool for validation prioritization, not as a replacement for RTL, gate-level, FPGA, or silicon reliability signoff.

More from our Archive