DOI: 10.1200/jco.2026.44.19_suppl.11 ISSN: 0732-183X

An integrative multi-omics machine learning framework for precision metastasis prediction and clinical staging in non-small cell lung cancer.

Jinbin Wang, Ling Yao, Keqin Gao, Zhen Lv, Qianya Wei, Xiping Xing, Ling Jin, Jianjun Wu, Dongjing Ma

11

Background: Traditional TNM staging inadequately captures the biological aggressiveness of NSCLC. While cell cycle dysregulation is a cancer hallmark, its role in driving invasiveness remains under-characterized. We developed a Lasso-Logistic machine learning (ML) framework to integrate cell cycle transcriptomics for enhanced metastasis and staging prediction. Methods: We integrated multi-omics data from TCGA, GEO (n=3), and CPTAC, along with five scRNA-seq datasets. A 14-gene signature was identified through Lasso-Logistic regression to calculate a CCRS. The biological interpretability of the findings was ensured by employing scRNA-seq pseudotime trajectory inference. The model was validated both in vitro using four cell lines and ex vivo through RT-qPCR on cDNA microarrays with 15 paired tissues, as well as in an independent clinical cohort. Results: The ML framework identified a 14-gene signature (notably CCNB1, CDK1, CCNA2) with superior discriminative power. In the discovery meta-cohort, the model achieved an AUC of 0.879 for metastasis prediction, maintaining a C-index of 0.740 in the TCGA. scRNA-seq analysis confirmed that the CCRS genes were significantly upregulated along the EMT axis ( P < 0.001), identifying a specific "invasive-proliferative" cellular state. Ex vivo validation via RT-qPCR revealed significant transcriptional heterogeneity, with key drivers CCNA2 and CCNB1 exhibiting >10-fold upregulation in tumor versus adjacent normal tissues. In the independent clinical cohort, the model demonstrated a 75% accuracy in distinguishing pathological stages, outperforming individual gene markers. Conclusions: This study presents a rigorously validated machine learning framework that translates complex cell cycle transcriptomics into a clinically applicable tool. By bridging the gap between computational ‘big data' and bedside diagnostics, this framework provides a scalable solution for identifying high-risk NSCLC patients, thereby potentially facilitating the intensification of personalized treatment.

Performance metrics of the multi-omics machine learning framework.

Validation Level
Source/Cohort (n)
Biological/Clinical Target
Performance Metric
Statistical Result
In silico (Training)
GEO Meta-cohort
Metastasis Prediction AUC 0.879
In silico (Test)
TCGA-LUAD/LUSC Metastasis Prediction C-Index 0.740
Proteomics
CPTAC (Proteome) Clinical Stage Correlation Spearman’s r Positive ( P < 0.05)
Single-cell
scRNA-seq (n=5) EMT Pseudotime Trajectory Wald Test P < 0.001
Experimental
NSCLC Cell Lines Proliferation & Invasion mRNA Fold-change Significant (vs Normal)
Clinical Ex vivo
cDNA Microarray (n=12) Real-world Staging Accuracy 75.0%
Key Driver 1
Clinical Tissue (n=15) CCNA2 Expression Tumor vs Normal > 10-fold ( P < 0.01)
Key Driver 2
Clinical Tissue (n=15) CCNB1 Expression Tumor vs Normal > 10-fold ( P < 0.01)

More from our Archive