Quality evaluation of AI-generated diabetes-related health education texts from different generative models
Xueping Jiao, Xingyu Liu, Fanghong Yan, Shuhan Yang, Yueting Wang, Chenxia Wang, Yunfang Wang, Yuhuan Xie, Yufang Guo, Yuxia Ma, Yanan ZhangBackground
With the increasingly widespread application of artificial intelligence technology, generative artificial intelligence has become an important tool for people to obtain health information due to its convenience and flexibility in health education or health promotion. However, the readability and accuracy of such AI-generated materials still need to be evaluated.
Objective
To comprehensively evaluate and compare the quality and readability of health education texts about diabetes generated by different generative artificial intelligence (AI) models.
Methods
We followed a fixed list of ten questions without modifications, systematically presenting the same inquiries to seven generative AI models and exporting their results into defined forms in the text generation process. Five experts were invited to evaluate the texts based on five criteria. The readability index, a readability formula, was used to evaluate the text’s readability. Kendall’s coefficient of concordance was employed to assess inter-rater reliability. The linear mixed model was used to compare the differences in five dimensions and readability among the health education texts generated by different AI models.
Results
Kimi-K1.5 and Doubao attained the highest overall scores in scientific accuracy, whereas iFlytek Spark-V3.5 received lower scores compared to other models. In terms of practical value and logical clarity, Kimi-K1.5 received the highest scores, while iFlytek Spark-V3.5 scored the lowest. In the dimension of reference basis, Kimi-K1.5 and ERNIE Bot-3.5 received relatively high scores, while iFlytek Spark-V3.5 and Doubao scored lower. In the assessment of text readability, higher R-value scores indicate poorer readability. The health education text generated by Doubao had the highest R-value, while iFlytek Spark-V3.5 had the lowest R-value.
Conclusions
Kimi-K1.5 performed better across multiple assessment parameters in the overall evaluation of diabetes-related health education texts created by different generative AI models. Notably, among all the models tested, iFlytek Spark-V3.5 showed the best readability.