DOI: 10.1136/bmjment-2026-302626 ISSN: 2755-9734

Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

Mark Kalinich, James Luccarelli, John Santa Maria, Frank Moss, John Torous

Background

Large language models (LLMs) are rapidly entering clinical and consumer use, yet their probabilistic outputs have delivered a variety of unsafe user responses. Difficulties in quantifying and mitigating risks posed by LLMs threaten to stall regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). Practical approaches are needed to extend existing medical-device regulations to LLM-SaMDs.

Objective

To demonstrate how simulation can extend existing medical-device risk management frameworks for addressing LLM-SaMD-specific risks.

Methods

We implement a simulation-based methodology for estimating LLM-SaMD risk. Fourteen open-source models were evaluated on three safety-classification tasks: suicidal-ideation, therapy-request and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and evaluated by psychiatrists. Model false-negative rates informed estimates of P 1 , the likelihood that a hazard progresses to a hazardous situation, and P 2 , the likelihood that that situation results in harm.

Findings

LLM success at generating synthetic datasets varied by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Performance generally improved with model size. Estimated P 1 values ranged from 1.1×10⁻⁸ to 1.6×10⁻⁴ and P 2 from 4.9×10⁻⁵ to 5.1×10⁻³, spanning four orders of magnitude.

Conclusions

By linking model failure modes to structured pathways to harm, simulation can extend existing medical-device risk frameworks to help address the probabilistic and context-dependent risks of LLM-SaMDs.

Clinical implications

Simulation-based risk estimation offers a practical way to characterise the risk landscape for specific LLM-SaMD, patient population and clinical context combinations.

More from our Archive