DOI: 10.4103/ajim.ajim_12_26 ISSN: 2666-1802

Artificial Intelligence-based Interpretation of Race and Sex Disparities among U.S. Internal Medicine Faculty: A Comparative Analysis

Stephanie Quon, Katherine Zheng

Abstract

Background:

Large language model (LLM)-based generative artificial intelligence (AI) tools are increasingly used for rapid synthesis and interpretation of academic datasets, including workforce equity metrics. Whether these tools can responsibly interpret race and sex disparities in academic internal medicine without obscuring nuance or reinforcing bias remains unclear.

Objective:

To compare how three widely accessible LLM-based platforms interpret race-and sex-based disparities in internal medicine faculty representation, rank distribution, and leadership patterns, benchmarked against a published human-led analysis.

Methods:

We compared LLM-generated interpretations with the longitudinal retrospective analysis by Xu et al. , which examined full-time U.S. internal medicine faculty with self-reported sex and race/ethnicity, stratified by academic rank and department chair status. Three LLMs received a single standardized prompt requesting descriptive analyses of representation, advancement, and leadership, including statistical reporting and limitations. Although the same source material (Association of American Medical College, Faculty Roster Table 19 workbook) was used, platforms differed in what they could ingest and analyze (spreadsheet vs. extracted subset vs. tabular image), and outputs were therefore evaluated as “first-pass” interpretations based on each platform’s input capabilities. Outputs were assessed as generated using a structured comparative framework, evaluating directional concordance and methodological restraint, given the variability.

Results:

All platforms identified rank-based “pipeline” patterns, showing more diversity at junior levels and less at senior ranks, with senior leadership mainly among historically advantaged groups. However, the methods used varied in rigor and caution. Reporting on statistical testing and effect sizes was inconsistent, and intersectional analysis was limited. Some outputs made unsupported claims based on available variables (e.g., leadership inferences when chair variables were missing), highlighting the risks of overreach in equity-sensitive contexts.

Conclusions:

Generative AI can identify and summarize high-level inequity patterns in internal medicine faculty data, but current platforms lack the interpretive depth, transparency, and input standardization needed for direct benchmarking in equity-focused workforce research, underscoring the need for human oversight and equity-centered analytic frameworks.

More from our Archive