Forensic-BERT: Explainable Transformer-Based Detection of Concealed Evidence in Cross-Platform Volatile Memory
Yousef Sanjalawe, Salam Al-E’mari, Sharif Naser MakhadmehAdvanced cyber threats increasingly exploit volatile memory to execute malicious payloads without touching persistent storage, rendering traditional disk-centric forensic tools insufficient for comprehensive digital investigations. This paper presents Forensic-BERT, an AI-driven forensic framework that automatically extracts and classifies potentially relevant artifacts from unstructured memory dumps across heterogeneous operating environments. The framework combines byte-boundary-preserving Hex-to-ASCII conversion, sliding-window Shannon entropy filtering (H>7.2 bits per byte, 256-byte windows) to isolate high-probability artifact regions, and a binary-aware WordPiece tokenizer extended with 2048 domain-specific tokens covering hexadecimal byte patterns, Windows API names, and Linux system-call sequences. These components feed a transformer-based classifier fine-tuned from bert-base-uncased (110 M parameters) on memory-derived text, with sliding-window inference and majority-vote aggregation for large images. A SHAP DeepExplainer module and averaged 12-head attention heatmaps provide transparent, analyst-accessible explanations for classification decisions. We evaluate the framework on a multi-source corpus of 735 labeled memory segments drawn from 197 distinct images across four independent collections, MemLabs, the DARPA Transparent Computing program, Digital Corpora, and live sandbox execution traces from Any.run and Joe Sandbox, spanning Windows XP through Windows 11, Ubuntu Linux 16.04/18.04, and FreeBSD. Source-stratified five-fold cross-validation yields an overall F1-score of 0.92±0.02 and AUC-ROC of 0.95±0.01 (95% CI). Forensic-BERT outperforms all six baselines, Volatility with YARA rules (F1 =0.71), Random Forest (F1 =0.82), BiLSTM with GloVe embeddings (F1 =0.85), MRm-DLDet (F1 =0.87), SPECTRE (F1 =0.89), and SecBERT (F1 =0.90), with every pairwise difference statistically significant under the McNemar test with Bonferroni correction. Explainability quality is independently confirmed by a Spearman rank correlation of ρ=0.81 between model SHAP token rankings and expert forensic-indicator rankings and by a System Usability Scale score of 73.2 among certified examiners. The complete pipeline processes 512 MB memory images in 7.5–10.2 s (GPU) or 38–52 s (CPU-only), scaling to 4 GB images with near-linear throughput. These results indicate that, on the corpus evaluated here, combining domain-adapted NLP preprocessing, transformer-based sequence modeling, and quantified explainability can improve the effectiveness and usability of analyst decision support and investigative triage for volatile memory analysis.