Evaluating AI-Generated Molecules for Drug Discovery: From Generic Metrics to Translational Readiness

doi:10.3390/ijms27135916

DOI: 10.3390/ijms27135916 ISSN: 1422-0067

Evaluating AI-Generated Molecules for Drug Discovery: From Generic Metrics to Translational Readiness

Xiaomeng Liu, Huanxiang Liu

Artificial intelligence-driven molecular generation has become an increasingly used computational approach for proposing candidate chemical structures in early-stage drug discovery, yet the practical value of the molecules produced is often difficult to judge. Many studies still rely mainly on model-level metrics such as validity, uniqueness, novelty, and diversity. These metrics describe whether a generator produces parsable, non-redundant structures that extend beyond a reference set, but they do not show whether the molecules are chemically credible, biologically relevant, or experimentally actionable. AI-generated molecules are best treated as testable hypotheses requiring staged, complementary evidence rather than judgments based on generic generative statistics. We discuss the interpretive limits of common metrics, examine complementary levels of evaluation including medicinal chemistry feasibility, target relevance and prediction reliability, structure-based plausibility, and translational readiness, and identify recurring failure modes such as false novelty, reward exploitation, predictor bias, docking overinterpretation, and selective reporting. We propose a six-stage, failure-aware evaluation framework spanning molecular correctness, medicinal chemistry feasibility, novelty and diversity in context, target relevance and prediction reliability, structure-based plausibility, and translational readiness. This framework does not replace experimental validation; instead, it helps align computational claims with the strength of supporting evidence and promotes more transparent and reproducible evaluation of AI-generated molecules in drug discovery.

Outline

Evaluating AI-Generated Molecules for Drug Discovery: From Generic Metrics to Translational Readiness

More from our Archive