Machine Learning Prediction of Thermal Properties of PHB/PHBV-Based Materials: A Quantitative Structure–Property Relationship Approach Using an Integrated Polymer Database
Nikolaos P. Sotiropoulos, Leonidas Mindrinos, Jean-David Peltier, Konstantina V. Filippou, Marianna I. Kotzabasaki, Nikolaos Tsigkas, Chrysanthos MaraveasBio-based and biodegradable polymers such as short-chain-length (scl) poly(3-hydroxybutyrate) (PHB) and poly(3-hydroxybutyrate-co-3-hydroxyvalerate) (PHBV) are widely adopted in diverse areas such as healthcare, manufacturing, and packaging. However, high production costs and the complexity of tailoring their thermal properties, such as glass transition temperature (Tg), melting temperature (Tm), and crystallization temperature (Tc), hinder further adoption. The current study reported on the development of a raw dataset of PHB and PHBV materials compiled from 572 instances collected from the literature (558 instances) and in-house experiments (14 instances). The dataset encompassed compositional physicochemical parameters, molecular features, and corresponding thermal characteristics. After assessing data quality and filtering for completeness and available features, curated datasets were created for machine learning (ML) analysis. Two ML models, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost), were utilized to predict values of Tg, Tc, and Tm using feature engineering methods that integrated chemistry-based descriptors with polymer-specific and experimental variables. The predictive performance of the models was systematically investigated using different combinations of input features to identify the most informative descriptor sets for each target property. The best-performing models were obtained using 118 data points for Tg and Tm and 201 data points for Tc, achieving R2 values of 0.77, 0.76, and 0.82 for Tg, Tc, and Tm, respectively. Despite the reliable prediction of the thermal properties of scl-PHAs, the main limitations of the study were the relatively small dataset size for certain targets and incomplete or missing reporting of experimental conditions in the literature sources, which may introduce variability in the compiled data. The findings implied that curated polymer datasets and interpretable ML models can support the rational design of sustainable polymers with tailored properties for specific applications.