DOI: 10.12688/f1000research.182153.1 ISSN: 2046-1402

NetFormer: A Dual-Stream Interpretable Transformer Autoencoder for Unsupervised Network Intrusion Detection

Mohammed A.S Al-Hitawi, Hiba A. Abu-Alsaad, Osama Mohammed, Omar Altalebi
Background The growing complexity and frequency of cyberattacks demand intrusion detection systems (IDS) that accurately identify malicious activity with very low false-positive rates and minimal latency. Traditional rule-based and classical machine learning methods fail to capture the long-range temporal dependencies inherent in multi-stage attacks, and even recurrent neural networks struggle with vanishing gradients over long sequences. Transformers, with self-attention, can model such dependencies, but their application to unsupervised network anomaly detection with mixed data types remains limited. Methods We introduce NetFormer, a novel Transformer-based unsupervised anomaly detection framework for network traffic time-series. The model features (1) a dual-stream embedding system that separately handles categorical and numerical features, (2) a reconstruction-based autoencoder trained exclusively on normal traffic to compute anomaly scores, and (3) an interpretability framework that visualizes attention maps to explain detection decisions. The architecture uses multi-head self-attention across multiple layers and is trained using mean squared error reconstruction loss on fixed-length flow windows. Results Evaluated on the CSE-CIC-IDS2018 benchmark, NetFormer achieves an F1-score of 0.851, precision of 0.842, recall of 0.861, and a false positive rate of only 1.24%, outperforming classical, LSTM-based, and other Transformer baselines. It excels at detecting volumetric attacks (DDoS F1 = 0.913) and also shows strong performance on slow-rate and subtle anomalies. Cross-dataset validation on UNSW-NB15 confirms robust generalization (F1 = 0.839). Attention map analysis demonstrates that the model focuses on attack-relevant time steps and traffic features, providing actionable interpretability. Conclusions The findings indicate that a reconstruction-based Transformer autoencoder with dedicated dual-stream embeddings effectively captures long-range temporal patterns for unsupervised network intrusion detection. The combination of high detection performance, low false-alarm rate, and built-in interpretability makes NetFormer a viable platform for operational security environments. Future work will address lightweight deployment, adaptive thresholding, and integration with graph-based topologies.

More from our Archive