Trends in the use of adult-specific preference-weighted health-related quality of life instruments in clinical trials over the past 50 years: a protocol for a meta-research study using deep learning-based natural language processing and large languag
Sarun Srikhom, Nancy Devlin, Nhung Nghiem, Sandra Nolte, Vu Vo, An Tran-DuyBackground
Health technology assessment bodies increasingly emphasise the importance of preference-weighted health-related quality of life (HRQoL) evidence. However, such measures are often absent in clinical trial publications. It is not yet clear how frequently clinical trials have incorporated these measures over the past five decades, how the use of preference-weighted HRQoL instruments has evolved over time, and how trends differ across disease areas, countries and global regions. This study aims to (1) assess changes over time in the proportions of clinical trials using each preference-weighted HRQoL instrument in adults, and (2) model secular trends in the adoption of these instruments across disease areas, countries and regions. The study will provide a comprehensive, systematic assessment of the use of preference-weighted HRQoL instruments in clinical trials since 1976 and develop a scalable approach for large-scale evidence synthesis.
Methods
We will identify clinical trials involving humans published in English since 1976 through systematic searches of MEDLINE, Embase, Cochrane Library and Web of Science. We will focus on generic preference-weighted HRQoL instruments for adults, including EQ-5D-3L, EQ-5D-5L, Short Form 6 Dimensions, 12-Item Short Form Health Survey (SF-12), Health Utility Index 2, Health Utility Index 3, Assessment of Quality of Life (AQoL) series (AQoL-4D, AQoL-6D, AQoL-7D, AQoL-8D), Quality of Well-Being Scale (QWB), QWB Self-Administered (QWB-SA), 15D and Patient-Reported Outcomes Measurement Information System (PROMIS) with the Preference Scoring System (PROPr). Screening and data extraction will be automated using natural language processing (NLP) pipeline or large language models (LLMs). To determine the most accurate approach, we will benchmark NLP and LLM performance against a manually curated reference dataset of 5000 randomly sampled articles reviewed independently by three reviewers. Model performance will be evaluated using classification metrics including accuracy, recall and F1-score. Annual counts and proportions of trials using each instrument will be calculated, stratified by disease area, country and region. Trends will be modelled using basis-splines (B-splines) with 2 or 3 degrees of freedom and Bayesian spline regression to estimate secular changes in both absolute numbers and proportions of instrument use over time.
Ethics and dissemination
This study uses only published literature and does not involve human participants or individual-level data. All results will be reported in aggregate form, with no identifiable information. Formal ethics approval is therefore not required. Findings will be disseminated via peer-reviewed publications and conference presentations, and aggregated data and analysis code will be made publicly available to support transparency and reproducibility.