Examiner training and calibration for simulated clinical examinations: A scoping review
Harish Thampy, Niamh Callanan, Arslan Ahmed, Sophia TaromsariAbstract
Introduction
Examiner training and calibration are widely recommended to improve scoring consistency and defensibility in simulation‐based observed clinical competency assessments (SOCCAs), yet the empirical evidence has not yet been comprehensively explored. This scoping review maps and describes studies evaluating general examiner training and calibration interventions in SOCCAs.
Methods
A scoping review was conducted of peer‐reviewed studies that reported empirical evaluation or descriptive outcomes of examiner training and/or calibration in any SOCCA format (including OSCEs and OSLERs) across health profession disciplines and training stages. Data were charted on intervention characteristics, study design, outcome measures and reported effects. Given study heterogeneity, findings were synthesised descriptively.
Results
Twenty studies met inclusion criteria. Seven described general examiner training, although intervention reporting and evaluation were often insufficient to judge effectiveness. Two broad calibration approaches were identified: (i) general calibration aimed at aligning examiners' internal performance standards and (ii) case‐orientation calibration tailored to specific stations subsequently assessed. Most calibration interventions used frame‐of‐reference approaches, commonly involving video‐recorded benchmark performances with facilitated discussion. Effects on scoring outcomes were inconsistent: some studies reported modest improvements in scoring reliability or accuracy, whereas others showed minimal change or increased examiner stringency. Only one study assessed outcomes beyond scoring metrics to study examiner behaviours.
Discussion
Despite widespread endorsement, the evidence base for examiner training and calibration in SOCCAs remains limited and inconsistent. Where benefits were observed, they appeared most evident for borderline/uncertain performances, suggesting calibration may be most useful near decision thresholds. Additionally, this review highlights an unresolved tension in calibration aims, whether to promote score convergence, shared reasoning or both. Future work should specify intended mechanisms, address contextual influences on judgement and extend evaluation to impact on assessor behaviour. In the absence of stronger evidence, routine implementation risks being driven by expectation and convention rather than a sufficiently robust empirical rationale.