MSeqDR PMD-VR: An Expert-Curated Virtual Registry of 11,000 Mitochondrial Disease Cases Established Through Literature Mining and Generative AI Augmentation
Lishuang Shen, Marie T. Lott, Elizabeth M. Mccormick, Colleen C. Muraresku, Kierstin Keller, Douglas C. Wallace, Zarazuela Zolkipli-Cunningham, Shamima Rahman, Marni J. Falk, Xiaowu GaiBackground/Objectives: Patient registries are essential for rare disease research, yet the extensive genetic and phenotypic heterogeneity of primary mitochondrial diseases (PMDs) makes traditional registry development slow and resource-intensive. We established the MSeqDR PMD virtual registry (PMD-VR) to address this gap through systematic literature mining and semi-automated data harmonization. Methods: The PMD-VR captures, standardizes, and harmonizes published case-level PMD data using a semi-automated curation pipeline. A data transformation framework maps heterogeneous raw data terms to standardized common data elements (CDEs). A generative AI (GenAI) platform leveraging large language models (LLMs), augmented by Human Phenotype Ontology (HPO) and external biomedical knowledge sources, accelerates data transformation and generates simulated clinical reports. Results: Currently, PMD-VR contains approximately 11,000 de-identified literature-derived cases, including over 2300 Leigh syndrome spectrum (LSS), 278 MELAS, and 300 CPEO cases. The pipeline mapped 872 heterogeneous terms to 102 standardized CDEs. Pathogenicity assessments were captured for variants in over 7900 cases, including 3800 with mtDNA pathogenic or likely pathogenic variants. Modes of inheritance were inferred for 5212 cases. PMD-VR has supported ClinGen Mitochondrial Diseases Gene Curation Expert Panel (Mito-GCEP) efforts, providing phenotyped evidence for 440 curated LSS cases across 113 PMD genes. Conclusions: PMD-VR is among the largest single PMD registries, offering a scalable, web-accessible platform for generating analysis-ready cohorts from the published literature. It represents a rich resource enabling comprehensive PMD characterization with unprecedented breadth of genetic and phenotypic knowledge.