Evaluating large language models for selection of statistical test for research: A pilot study

doi:10.4103/picr.picr_275_23

DOI: 10.4103/picr.picr_275_23 ISSN: 2229-3485

Evaluating large language models for selection of statistical test for research: A pilot study

Himel Mondal, Shaikat Mondal, Prabhat Mittal

General Medicine

Show PDF Cite

Abstract

Background:

In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection.

Aim:

This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts.

Materials and Methods:

A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts.

Results:

Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P < 0.0001. The test–retest reliability of ChatGPT was r = 0.71 (95% CI: 0.44–0.86), P < 0.0001, Bard was r = −0.22 (95% CI: −0.56–0.18), P = 0.26, Bing was r = −0.06 (95% CI: −0.44–0.33), P = 0.73, and Perplexity was r = 0.52 (95% CI: 0.16–0.75), P = 0.0059.

Conclusion:

The LLMs, namely, ChatGPT, Google Bard, Microsoft Bing, and Perplexity all showed >75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.

Outline

Evaluating large language models for selection of statistical test for research: A pilot study

Abstract

Background:

Aim:

Materials and Methods:

Results:

Conclusion:

More from our Archive