Outcome and Exposure Polygenic Risk Scores Can Help Reduce Information Bias and Selection Bias in Regression Estimates From Biobank Data
Maxwell Salvatore, Ritoban Kundu, Jiacong Du, Christopher R. Friese, Alison M. Mondul, David Hanauer, Haidong Lu, Celeste Leigh Pearce, Bhramar MukherjeeABSTRACT
Electronic health records (EHRs) are valuable sources of data but are susceptible to biases from missing data and sample selection, often due to clinically informative visiting processes and non‐probability sampling. This research explores whether genetic data, typically measured on nearly all participants in EHR‐linked biobanks, can be used to mitigate these biases. Simulations were performed under conditions of missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) within random and biased sampling frameworks. We evaluated PRS‐informed imputation, PRS‐uninformed imputation, and complete case analysis across these scenarios in terms of bias, coverage, and root mean square error (RMSE) of the regression coefficient estimates. A real‐world example using data from the Michigan Genomics Initiative (MGI, n = 68,063) compared the effectiveness of these methods against national benchmark estimates. PRS‐informed imputation generally reduced bias and RMSE and improved coverage, particularly under MAR conditions in random samples. In analyses of biased samples of n = 10,000 with MAR exposure‐only missingness, weighted, PRS‐informed imputation analyses showed substantially lower percent bias (0.6%) and closer to nominal coverage (89.1%) compared to weighted, complete case analyses (9.4%; 74.3%). The MGI estimates showed that PRS‐informed approaches aligned more closely with national benchmarks than unweighted complete case analysis. Leveraging genetic data with sample weighting can help reduce bias in outcome‐exposure association estimates derived from biobank data. When available, researchers should consider including PRS for imputation and survey methods for sample weighting when estimating outcome‐exposure association coefficients in a target population of interest, recognizing that benefits may vary by outcome and data structure.