DOI: 10.4103/wkrj.wkrj_41_26 ISSN: 3117-9789

Machine Learning Identification of Proteomic Signatures in MSC-Derived EVs Using PRIDE Data

Yermekbayeva Kalzhan

A
BSTRACT

Background:

Mesenchymal stem cell-derived extracellular vesicles (MSC-EVs) are promising cell-free therapeutic candidates in regenerative medicine because their activity is mediated by bioactive cargo, including proteins. Compared with transcriptomics alone, proteomics provides a closer representation of functional molecular effectors. This study aimed to identify candidate proteomic signatures of MSC-EVs using public PRIDE data and an exploratory machine learning (ML) framework.

Methods:

Label-free quantitative proteomics data were obtained from the PRIDE dataset PXD020948, which contains extracellular vesicles (EVs) derived from adipose tissue, bone marrow, and umbilical cord mesenchymal stem cells. Protein intensity data were extracted from the proteinGroups file, cleaned by removing contaminants and reverse identifications, and processed using missing-value handling, log 2 transformation, and z -score scaling. Differential protein expression analysis was performed using a t -test/Welch-type approach with Benjamini–Hochberg correction. Random Forest, Support Vector Machine (SVM), and Extreme Gradient Boosting were applied for exploratory classification and feature prioritization. Feature selection was performed using Random Forest importance, L1-regularized logistic regression, and recursive feature elimination. STRING, Gene Ontology, KEGG, and Reactome analyses were used for functional interpretation. In accordance with the optional high-impact step in the assignment, a limited GEO transcriptomic comparison was performed by gene-symbol overlap only and was not treated as external validation.

Results:

After preprocessing, 1014 proteins across 9 samples were retained for analysis. Differential expression analysis identified 101 proteins meeting the predefined exploratory thresholds between adipose- and bone marrow-derived EVs (adjusted P < 0.05 and | log 2 FC| >1). The ML models showed perfect apparent/internal separation, including SVM area under the curve = 1.00; however, these metrics were interpreted as nonconfirmatory because of the high-dimensional small-sample design. Feature prioritization identified a core candidate signature consisting of QSOX1, SOD1, and SULF1, together with an extended 15-protein panel including COL3A1, COL4A1, COL6A1, COL15A1, ABI3BP, CHI3L1, CYR61/CCN1, ENPP2, and EPB41L3. Functional enrichment showed significant protein–protein interaction enrichment ( P = 1.4 × 10 −7 ) and associations with extracellular matrix organization, collagen remodeling, tissue development, and PDGF-related signaling. The optional GEO comparison identified COL6A1 as the only overlapping transcriptomic/proteomic candidate, which was interpreted as limited supportive convergence rather than validation.

Conclusions:

This study provides a proteomics-based, ML-supported workflow for identifying candidate MSC-EV protein signatures from public PRIDE data. The identified proteins are hypothesis-generating candidates rather than validated clinical biomarkers. The optional GEO/multiomics step supported COL6A1 as a limited cross-layer convergence signal, but this result should not be interpreted as confirmatory evidence because only one overlapping marker was detected, and no joint cross-platform model was performed.

More from our Archive