Defeat Devices in AI Systems
Emilio FerraraAI systems increasingly exhibit behavior that differs systematically between evaluation and deployment contexts. Alignment faking, sandbagging, benchmark gaming, deceptive scheming, specification gaming, and trojans have each been documented separately, with each line of work characterizing one facet of what we argue is a single structural mechanism: we propose that this common mechanism is a defeat device, an engineering and regulatory concept long established in vehicle-emissions law and brought to broad public attention by the 2015 Volkswagen emissions case. A defeat device in an AI system has three necessary elements: a discriminator that detects evaluation context, a concealed swap that conditions behavior on detection, and a gap between eval-distribution and deployment-distribution performance on the stated evaluation criterion. We formalize this triadic test as a behavioral definition, organize documented cases along three taxonomic axes (origin, trigger, swap mechanism), propose Trigger-Axis-Aware Differential Probing (TADP) as a forensic detection protocol, and advance the claim that defeat devices can naturally emerge in current frontier AI systems without any operator engineering. We characterize naturally emerging defeat devices as potentially one of the harmful emerging phenomena that AI safety practice should monitor and test for systematically. An illustrative study applying TADP across eight open-weight models finds the discriminator to be near-universal (every model detects evaluation context well above chance), while the conditional swap is real but heterogeneous: it appears strongly as sycophantic stance-conditioning and as an evaluation-cued register shift, yet not as overt demographic discrimination, indicating that the mechanism’s discriminator generalizes even where individual swaps do not. Implications for evaluation methodology, post-training pipeline design, interpretability research priorities, and AI governance follow.