From API to Action: A Multi-Model Comparison of OpenAI, Anthropic, Google, and Meta LLMs for Clinical Trial Data Extraction
Richard J. Young, Jorge Fonseca, Brach Poston(1) Background: Clinical trial data extraction from registries such as ClinicalTrials.gov remains labor-intensive and error-prone, often missing critical details hidden in unstructured protocol descriptions. Large Language Models (LLMs) offer potential to automate this process, yet systematic multi-model comparisons on real clinical trial data remain scarce. (2) Methods: Four LLMs (OpenAI o4-mini-high, Anthropic Claude-Sonnet-4, Google Gemini 2.5-Pro, and Meta Llama-4-Maverick) extracted brain stimulation parameters from 67 transcranial direct current stimulation (tDCS) trials in Parkinson’s disease via a structured JSON schema. Pairwise inter-model agreement was quantified with Cohen’s Kappa and percentage agreement across binary, categorical, and multi-component task tiers. (3) Results: Under exact-string matching, agreement was near-perfect for binary classifications (non-invasive classification: 100%; brain stimulation presence: 99.3%, κ = 0.50) and substantial for categorical extractions (primary stimulation type: 96.4%, κ = 0.70), but fell to 48.6% (κ = 0.43) for complex anatomical targets. Numeric parameters revealed model-specific strengths: o4-mini-high and Claude-Sonnet-4 achieved perfect duration agreement (r = 1.000, n = 19) while Llama-4-Maverick diverged substantially (r < 0.12). Validation against an expert gold standard (100% inter-annotator agreement on a 20-trial overlap) confirmed high extraction accuracy across all features (mean 93.7–98.9%). Crucially, the low agreement on anatomical targets proved to be an artifact of exact-string scoring: under the same semantic matching used to measure accuracy, inter-model agreement rose to 97.0%, coinciding with the 95.5% expert accuracy. Inter-model agreement therefore tracks accuracy once both are measured on a common basis. (4) Conclusions: Exact-string inter-model agreement decreases with task complexity, but this decline largely reflects interchangeable free-text wording rather than reduced accuracy. Evaluated semantically, agreement and expert accuracy are both high and closely aligned. A residual risk is not low accuracy but the rare error shared across all models, which agreement cannot detect, and which overall accuracy can itself mask when one class dominates. These findings inform hybrid human–AI systematic review pipelines in which targeted expert oversight focuses on shared-error and minority-class detection.