DOI: 10.3390/electronics15132848 ISSN: 2079-9292

Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation

Mounika Krishna Teja Karumudi, Fabio Di Troia

Deep learning-based malware classification using image representations has emerged as a highly effective paradigm for threat detection. However, training robust neural networks is frequently bottlenecked by data scarcity and severe class imbalances in real-world repositories. This study investigates the viability of using an autoregressive PixelCNN framework to synthesize high-fidelity, class-specific malware images to augment limited training distributions. Utilizing the benchmark Malimg dataset, we systematically evaluate a Convolutional Neural Network (CNN) classifier across varying ratios of synthetic-to-authentic data under strict data scarcity constraints (ranging from 10 to 80 authentic samples per family). Our experimental results reveal that while PixelCNN successfully replicates intricate, byte-level micro-textures, classifiers trained exclusively on synthetic data experience catastrophic performance degradation, yielding an accuracy of just 3%. Crucially, however, the introduction of a minimal authentic data anchor (15% to 20%) restores functional decision boundaries, immediately elevating classification accuracy up to 72%. Furthermore, performance saturates rapidly once the training matrix reaches a 50/50 synthetic-to-authentic split, achieving up to 82% classification accuracy, rendering it highly competitive with the 89% accuracy upper bound of a fully authentic baseline. These findings demonstrate an exceptional degree of data efficiency, proving that generative autoregressive augmentation can halve the authentic data collection burden in cybersecurity workflows provided a minor, real-world baseline anchor is preserved.

More from our Archive