Cross-Cohort Validation and Transfer Learning of Mesenchymal Stem/Stromal Cell-derived Extracellular Vesicle Transcriptomic Signatures Using ArrayExpress and Gene Expression Omnibus Datasets
Gainizhamal Oralbay
A
BSTRACT
Background:
Machine-learning studies in transcriptomics often report high internal performance but fail to generalize across independent cohorts because of overfitting, batch effects, and dataset-specific biological variation. Mesenchymal stem/stromal cell-derived extracellular vesicles (MSC-EVs) are promising cell-free mediators of immunomodulation and tissue repair, but reproducible transcriptomic signatures of MSC-EV response remain insufficiently validated across public datasets.
Methods:
Public MSC/EV-related transcriptomic datasets were screened from Gene Expression Omnibus (GEO) and the ArrayExpress collection in BioStudies. GEO GSE237991 was selected as the discovery cohort, and ArrayExpress E-MTAB-13966 was selected as an independent validation cohort based on MSC/EV relevance, human origin, RNA-seq compatibility, and availability of processed expression matrices. Gene identifiers were harmonized, overlapping genes were retained, and expression matrices were log2-transformed and gene-wise z-score normalized. Batch effects were assessed by principal component analysis before and after ComBat correction. Differentially Expressed Gene (DEG)-derived features were selected in the GEO cohort and used to train Random Forest, linear Support Vector Machine (SVM), and XGBoost classifiers. Models trained on GEO were externally validated on ArrayExpress. Few-shot transfer learning was then performed by adapting GEO-trained models using a small labeled subset of ArrayExpress samples. Biomarker stability was evaluated by integrating DEG strength, model feature importance, validation-cohort effect size, and cross-cohort direction concordance.
Results:
Harmonization retained 17,049 common genes across 18 samples. GEO-based feature selection identified 29 DEG/machine learning (ML) features. In internal GEO leave-one-out cross-validation, Random Forest achieved the highest receiver operating characteristic-area under curve (ROC-AUC) (0.944). However, in independent ArrayExpress validation, the linear SVM showed the strongest generalization, with accuracy 0.833 and ROC-AUC 0.889, whereas Random Forest and XGBoost showed weaker external performance. Few-shot transfer learning improved mean holdout ROC-AUC across models, reaching 1.000 for Random Forest and SVM and 0.778 for XGBoost in exploratory adaptation analysis. Biomarker stability analysis identified an 8-gene robust panel: NFKBIZ, RAB9B, NAMPT, TNIP1, GRK4, C1QTNF1, MMP10, and SIRT2. Functional enrichment of the broader DEG/ML feature set highlighted inflammatory and immunomodulatory pathways, including lipopolysaccharide response, interleukin-17 signaling, NF-κB signaling, rheumatoid arthritis, chemokine receptor binding, and matrix metalloproteinase pathways. The robust biomarker panel was enriched for NAD and nicotinamide metabolism, driven mainly by NAMPT and SIRT2.
Conclusion:
Cross-cohort validation revealed that the model with the best internal discovery performance was not the most externally generalizable, underscoring the need for independent validation in omics ML. Linear SVM showed the strongest direct cross-cohort performance, while few-shot transfer learning improved target-cohort adaptation. The identified robust biomarker panel may represent reproducible MSC-EV-associated inflammatory and metabolic-stress response biology, although larger independent cohorts and experimental validation are required.