A Data-Driven Evolutionary Optimization Approach for Complex Chinese Text Analysis via Surrogate Model Management
Jiheng Yuan, Jian-Yu LiWith the rapid growth of Chinese social media data, many language-driven analytical tasks, such as sentiment analysis and malicious account detection, are increasingly formulated as computationally expensive optimization problems, particularly in the context of hyperparameter tuning for deep learning models. Due to the intrinsic characteristics of Chinese text, including implicit word boundaries, strong context dependency, and high linguistic variability, the resulting feature representations are often high-dimensional, sparse, and heterogeneously distributed. From an optimization perspective, these properties induce highly irregular, non-smooth, and multimodal objective landscapes, posing significant challenges to conventional surrogate-assisted data-driven evolutionary algorithms (DDEAs). To address this problem, this paper proposes a Normal Selection-based data-driven evolutionary algorithm (NSEA) for improving surrogate-assisted optimization under complex conditions. Specifically, a Normal distribution-based selection strategy (NSS) is developed to enable probabilistic selection of surrogate models, balancing exploitation of high-performing models and exploration of alternative candidates, thereby alleviating premature convergence in multimodal search spaces. In addition, an exponential weighting ensemble (EWE) method is introduced to aggregate surrogate models based on their relative ranking performance, which enhances the stability and generalization capability of fitness approximation across different regions of the search space. Extensive experiments on benchmark functions demonstrate that the proposed NSEA consistently outperforms several state-of-the-art DDEAs in terms of optimization accuracy and robustness. Furthermore, a real-world application of cheating official account (COA) detection on Chinese social media is conducted, in which the hyperparameter optimization of a heterogeneous graph transformer (HGT) model is formulated as an EOP. The results further prove the effectiveness and practical applicability of the NSEA in complex data-driven scenarios. Overall, this study provides an effective optimization framework for handling EOPs with complex and multimodal characteristics and offers a feasible computational approach for tasks associated with large-scale Chinese textual data.