Semi‐Supervised Gaussian Mixture Regression With Gaussian Mixture Model Pretraining With Unlabeled Data
Hiromasa KanekoABSTRACT
Gaussian mixture regression (GMR) is a useful framework for both forward prediction and direct inverse analysis because it models the joint probability distribution of input variables x and output variables y. However, when full covariance matrices are used, the number of fitting parameters increases rapidly with dimensionality, making estimation unstable when only a limited number of paired (x, y) samples are available. In this study, two semi‐supervised extensions of GMR are proposed to exploit abundant unlabeled x data. The first method, semi‐supervised GMR with x‐only GMM pretraining (ssGMR‐xGMM), builds a Gaussian mixture model (GMM) using only x samples and uses the obtained parameters to initialize the joint GMM for GMR. The second method, semi‐supervised GMR with x‐only GMM pretraining and x‐mean anchoring (ssGMR‐xGMM‐xMA), further fixes the x‐side mean vectors during training to preserve the mixture structure learned from unlabeled x data. The proposed methods were evaluated using numerical simulation data and real molecular and spectral datasets, including boiling point, aqueous solubility, pharmacological activity, environmental toxicity, and tablet spectral datasets. Compared with conventional GMR, both proposed methods improved predictive performance, and ssGMR‐xGMM‐xMA showed the best overall performance. Under the transductive semi‐supervised setting examined in this study, these results indicate that leveraging unlabeled x data can be effective for stabilizing parameter estimation and improving GMR accuracy in data‐scarce settings.