Machine Learning-Directed Discovery and Statistical Validation of Post-COVID-19 Condition Sequelae Using Military Health System Data
Jed Shakarji, Apryl Susi, Zella Berill, Remle Scott, Dominic Nathan, Cade M. NylundBackground: Post-COVID-19 conditions (PCCs) present a significant public health challenge due to a vast array of new or persistent health symptoms across subjects. The complex, multi-systemic nature of PCCs makes these conditions difficult to differentiate from other non-COVID-19 related medical conditions. While the Military Health System Data Repository (MDR) provides a robust supply of population-level encounter data, its high-dimensional structure poses challenges for knowledge discovery and outcome research. Objectives: The primary aim of this study was to identify novel manifestations of PCCs among active-duty service members, and model the probabilistic relationships between PCC-related diagnoses. We propose a machine learning workflow as an effective tool for knowledge discovery to statistically validate candidate PCCs from large datasets. Methods: We conducted a retrospective cohort study using MDR records from July 2018 to June 2023. From an initial pool of 311,367 eligible Active-Duty Tricare beneficiaries, we isolated 101,789 COVID-19 infections and matched them 1:1 with uninfected controls (N = 203,578 total) based on age, sex, and propensity for COVID-19. Encounter data was mapped to 392 clinical categories using the Healthcare Cost and Utilization Project (HCUP) Clinical Classification Software Refined (CCSR). Candidate PCC categories were isolated using a cross-validated lasso regression model optimized with a Tree of Parzen Estimators algorithm. A consensus Bayesian Network structure was fitted to model potential probabilistic dependency structures between identified PCCs and prior COVID-19 diagnosis. Finally, conditional Cox proportional hazards models were used to statistically validate selected novel conditions using larger cohorts drawn from the same initial eligible pool by matching cases 1:2 with controls. Results: Feature selection reduced the diagnosis set by 97.96%, isolating 8 clinical categories from the initial 392. The model confirmed known PCCs, such as respiratory symptoms and malaise, and identified two potentially novel candidate PCCs: tinnitus and personality disorders. Survival analysis validated the selection of tinnitus, showing a significant association with COVID-19 (HR: 1.17, 95% CI: 1.12–1.22). No significant association was found between COVID-19 infection and personality disorders (HR: 1.11, 95% CI: 0.97–1.26). Conclusions: This study demonstrates an effective analytical pathway for addressing the limitations of analyzing complex, high-dimensional healthcare billing data. The methodology successfully generated testable hypotheses, identifying tinnitus as a relevant sequela, and is generalizable to future research involving unknown health outcomes related to prior infection.