Artificial Intelligence-based Interpretation of Race and Sex Disparities among U.S. Internal Medicine Faculty: A Comparative Analysis
Stephanie Quon, Katherine ZhengAbstract
Background:
Large language model (LLM)-based generative artificial intelligence (AI) tools are increasingly used for rapid synthesis and interpretation of academic datasets, including workforce equity metrics. Whether these tools can responsibly interpret race and sex disparities in academic internal medicine without obscuring nuance or reinforcing bias remains unclear.
Objective:
To compare how three widely accessible LLM-based platforms interpret race-and sex-based disparities in internal medicine faculty representation, rank distribution, and leadership patterns, benchmarked against a published human-led analysis.
Methods:
We compared LLM-generated interpretations with the longitudinal retrospective analysis by Xu
Results:
All platforms identified rank-based “pipeline” patterns, showing more diversity at junior levels and less at senior ranks, with senior leadership mainly among historically advantaged groups. However, the methods used varied in rigor and caution. Reporting on statistical testing and effect sizes was inconsistent, and intersectional analysis was limited. Some outputs made unsupported claims based on available variables (e.g., leadership inferences when chair variables were missing), highlighting the risks of overreach in equity-sensitive contexts.
Conclusions:
Generative AI can identify and summarize high-level inequity patterns in internal medicine faculty data, but current platforms lack the interpretive depth, transparency, and input standardization needed for direct benchmarking in equity-focused workforce research, underscoring the need for human oversight and equity-centered analytic frameworks.