DOI: 10.3390/biomedicines14071438 ISSN: 2227-9059

Unsupervised Deep Representation Learning and Probabilistic Clustering for the Systems-Level Discovery of Germline Mutation Signatures in Pediatric Cancers

Fahimeh Palizban, Michael E. March, Xiang Wang, James Snyder, Fengxiang Wang, Frank Mentch, Yeshwanth Mahesh, Alexandria Thomas, Deborah J. Watson, Huiqi Qu, John Connolly, Amir Hossein Saeidian, Hassan Vahidnezhad, Joseph Glessner, Hakon Hakonarson

Background/Aims: While pathogenic germline variants play a critical role in pediatric cancer susceptibility, traditional clinical genetics primarily focuses on single-gene interpretations. Transitioning to a systems-level analysis of inherited variation can uncover shared biological vulnerabilities, informing genetic counseling, surveillance, and targeted therapeutics. This study aims to implement an unsupervised machine learning framework to identify and characterize Germline Mutation Signatures (GMS) across diverse pediatric malignancies, elucidating latent genomic patterns that reveal shared oncogenic mechanisms. Methods: We analyzed germline whole-exome and whole-genome sequencing (WES/WGS) data from a retrospective cohort of 420 pediatric cancer patients and matched non-cancer controls. Variants were deeply annotated to capture multi-dimensional features, including predicted pathogenicity, splice-site disruption, regulatory impact, population frequency, and sequence context. To enable robust modeling, we integrated an augmented feature set encompassing evolutionary constraint, loss-of-function intolerance, and compositionally normalized substitution spectra. These high-dimensional annotations were processed using a deep autoencoder for non-linear representation learning, followed by Gaussian Mixture Modeling (GMM) of the latent space. Results: The framework delineated 13 signatures (GMS1–GMS13), yielding an optimal Davies–Bouldin index of 1.051. These signatures map to fundamental biological processes, including DNA repair deficiencies, transcription-coupled damage, replication stress, and aberrant RNA regulation. Crucially, these GMSs transcend traditional tissue-of-origin classifications, manifesting across multiple distinct cancer types. This observation indicates convergent germline etiologies and suggests potential shared susceptibilities to pathway-directed therapies. Conclusions: The discovery of these cross-cancer signatures provides a scalable, biologically interpretable framework for decoding inherited pediatric cancer risk. While the therapeutic mapping networks identified are currently exploratory and serve as a hypothesis-generating foundation, this deep learning-driven paradigm establishes a robust basis for stratified precision medicine. Pending prospective clinical validation, this approach holds significant translational potential to move beyond single-gene paradigms toward unified, systems-level precision oncology strategies.

More from our Archive