A Comparative Study of Large Language Models for Industrial Cyber-Physical Security
J. de Curtò, I. de Zarzà, Juan Carlos Cano, Carlos T. CalafateIntrusion detection in industrial cyber-physical systems is constrained by small labelled-attack corpora and by the subtler signal of physical-process attacks compared with classical IT-network intrusions, motivating renewed interest in foundation-model-based detectors; classical detectors are typically trained per dataset and degrade under the distribution shift that is common in operational technology, where attack repertoires evolve faster than retraining cycles. Two foundation-model families are now plausible candidates: open-source Large Language Models (LLMs) and recent tabular foundation models (TabPFN, TabICL) pre-trained for in-context tabular inference. We compare the two families head-to-head, alongside Random Forest and XGBoost classical anchors, across three established industrial security benchmarks (SWaT, HAI, WUSTL-IIoT-2021) under a controlled multi-seed full-holdout protocol with paired McNemar and cross-seed Mann–Whitney tests. The empirical picture is dataset-dependent rather than universal: tabular foundation models establish a strong, previously unreported baseline that is competitive with or superior to classical anchors on every dataset evaluated, while LLMs are complementary detectors with a specific advantage on schemas that carry process-engineering semantics (such as SWaT’s named sensor channels). A per-class analysis on the WUSTL five-class attack taxonomy shows that the two families have structurally different strengths: tabular methods dominate traffic-rich attacks (Denial-of-Service, Reconnaissance), whereas LLMs are competitive on rare attack types (Backdoor, Command Injection). A confidence-gated cascade that escalates only low-confidence tabular decisions to an LLM exceeds either detector alone at a small query budget, and a leave-one-attack-type-out analysis shows that foundation-model detectors generalise to unseen attack families substantially better than the classical anchors. The appropriate detector choice in industrial cyber-physical security is therefore informed by the dataset’s feature schema, the attack-type mix, and the operational cost envelope, rather than by a specific performance metric.