Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach
Hessa Abdulaziz AlJuhaiman, Qazi Emad-ul-Haq, Kyounggon Kim, Seokhee LeeThe exponential growth of Indicators of Compromise (IoCs) has overwhelmed manual triage processes in Security Operations Centers (SOCs), necessitating automated solutions for large-scale log analysis. This study proposes a hybrid machine learning framework that integrates supervised and unsupervised learning to automate the classification, clustering, and contextual interpretation of Cyber Threat Intelligence (CTI). The primary contribution lies in a multi-stage feature engineering pipeline that enriches raw SIEM logs with cyclical temporal encoding and geographical metadata. In the supervised phase, a comparative evaluation of gradient boosting classifiers—XGBoost, LightGBM, and CatBoost—demonstrates that all three achieve competitive performance in categorizing known attack techniques, consistently outperforming the Random Forest baseline. The results indicate that classifier performance is dataset-dependent, and practitioners are encouraged to select the most suitable model based on their operational environment. Simultaneously, the unsupervised phase employs density-based clustering to identify emerging and previously unknown threat patterns by correlating adversarial behaviors with source attribution. By combining these two approaches, the framework ensures near-real-time feasibility and significantly enhances the scalability of automated threat extraction from distributed honeypot environments.