Systematic Literature Review on Software Security Vulnerability Information Extraction
Sofonias Yitagesu, Zhenchang Xing, Xiaowang Zhang, Zhiyong Feng, Tingting Bi, Linyi Han, Xiaohong LiBackground. Software vulnerabilities are increasing in complexity and scale, posing great security risks to many software systems. Extracting information about software vulnerabilities is a critical area of research that aims to identify and create a structured representation of vulnerability-related information. This structured data helps software systems better understand vulnerabilities and provides security professionals with timely information to mitigate the impact of rapidly growing vulnerabilities while guiding future research to develop more secure systems. However, this process relies on the effectiveness of information extraction to transform manual vulnerability analysis from security experts to digital solutions. Despite its importance, the unique nature of vulnerability information and the fast pace at which machine learning-based extraction methods and techniques have evolved make it challenging to assess the current successes, failures, challenges, and opportunities within this research area. This study presents a systematic literature review aimed at clarifying this complex landscape.
Methods. In this study, we conduct a systematic literature review (SLR) to explore existing research focusing on extracting information about software security vulnerabilities. We search for 829 primary studies on security vulnerability information extraction from seven widely used online digital libraries, focusing on top peer-reviewed journals and conferences published between 2001 and 2024. After applying our inclusion and exclusion criteria and the snowballing technique, we narrowed our selection to 87 studies for in-depth analysis and addressed four main research questions. We collect qualitative and quantitative data from each study, identifying 34 components such as research problems, methods, contributions, evaluation metrics, results, types of extracted vulnerability information, challenges, and limitations. We use meta-analysis, statistical machine learning, and text-mining techniques to identify themes, patterns, and trends across the primary studies and visualize findings.
Results. The study provides an overview of the security vulnerability data landscape, identifies key resources, and guides efforts to improve vulnerability information extraction and analysis. The study finds a diverse landscape of learning algorithms used in security vulnerability information extraction, with Bidirectional Encoder Representations from Transformers (BERT), Long Short-term Memory (LSTM), and Support Vector Machine (SVM) being the most dominant. The study identifies key challenges, including feature engineering complexity, lack of a gold-standard corpus, preprocessing errors, generating accurate training data, addressing imbalanced data, multimodality fusion, and graph sparsity in security knowledge graphs.
Insights for Future Research Directions.
The study underscores the need for advanced extraction approaches, robust datasets, automated annotation methods, and advanced machine learning algorithms to improve the extraction of security vulnerability information. This study also suggests using large language models (LLMs) and transformer models to facilitate the automatic extraction of security-related words, terms, concepts, and phrases and introduce new filtering parameters for user requirements. We provide all our implementations; it can be found at