Machine Learning Algorithm for the Detection of Tumor Microsatellite Instability Based on Multiomics Biomarkers
Kyle C. Strickland, Zachary D. Wallen, Sarabjot Pabla, Heidi C. Ko, Rebecca A. Previs, Michelle F. Green, Stephanie Hastings, Alicia Dillard, Pratheesh Sathyan, Kamal S. Saini, Taylor J. Jensen, Brian J. Caveney, Marcia Eisenberg, Shakti Ramkissoon, Eric A. SeversonPURPOSE
Accurate classification of microsatellite instability (MSI) in advanced cancers is critical for identifying patients who may benefit from immune checkpoint inhibitors. However, variability in MSI detection workflows can lead to missed MSI-high cases, indicating need for complementary screening approaches. Using next-generation sequencing (NGS) data from colorectal tumors, we developed a machine learning (ML) model to predict MSI status using immune-related gene expression profiles and pathogenic single-nucleotide variants (SNVs) and copy-number variants (CNVs).
MATERIALS AND METHODS
We analyzed NGS data from 2,756 patients with colorectal cancer (CRC), including DNA panel results for SNVs and CNVs, RNA sequencing of immune-related genes, and tumor mutation burden (TMB). ML algorithms were trained on 70% of the CRC cohort using TMB and selected features by Boruta algorithm. Trained models were tested on the remainder of the CRC cohort and The Cancer Genome Atlas (TCGA) colorectal (COAD) and rectal (READ) adenocarcinoma data sets. To assess the translatability to other cancer types, uterine and gastric cancer cases were tested.
RESULTS
Feature selection identified 107 features for model training, including SNVs and CNVs. The CART model with the highest mean accuracy, precision, and recall showed strong performance across the CRC, TCGA COAD/READ, uterine, and gastric cancer cohorts, ranging from 78% sensitivity in uterine cancer to 99%-100% specificity and negative predictive value in CRC. Of the 53 indeterminate CRC and uterine cases, 15% were classified as likely MSI-high. Of these, 75% had mismatch repair immunohistochemistry results available, with 83% showing MLH1 and PMS2 loss.
CONCLUSION
Our ML approach accurately predicted MSI status in colorectal and uterine cancers using multiomics data derived from NGS, without relying on direct microsatellite sequencing. The ability to identify MSI-high tumors among indeterminate cases demonstrates potential to improve diagnostic precision and ensures timely access to immunotherapy for patients with MSI-high disease.