LLM-Assisted Semantic Pruning for Genetic Programming-Based Alpha Factor Discovery
Hang Chen, Rui QiGenetic programming (GP) has been widely used in quantitative finance for discovering formulaic alpha factors that can predict asset returns. However, GP often produces overgrown expressions that are difficult to interpret and expensive to evaluate. This paper proposes a large language model (LLM)-assisted pruning framework that reviews expression trees generated by GP, with the LLM acting as a semantic reviewer that flags redundant or financially implausible branches based on structural complexity and contextual reasoning. The proposed method is formalized as a closed-loop Trigger–Evaluate–Decide–Execute (TEDE) process. We present mathematical formulations, algorithmic design, and examples showing how redundant nested functions can be simplified while monitoring predictive performance. Experiments with high-frequency cryptocurrency market data, using DeepSeek-V4-Flash as the semantic engine, show lower expression complexity and higher rubric-based interpretability scores for the pruned symbolic factors. Under the reported test setup, the LLM-pruned configuration has higher Information Ratio (IR) values than the listed baselines and more compact expression trees than the GP baselines.