Comparative Study Journal of orthopaedic surgery and research. 2024 Sep 18;19(1):574. doi: 10.1186/s13018-024-04996-2 Q22.82024

Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis

大型语言模型在糖皮质激素诱导的骨质疏松症中的性能比较分析：ChatGPT-3.5、ChatGPT-4和Google Gemini的对比研究翻译改进

Linjian Tong¹, Chaoyang Zhang², Rui Liu¹, Jia Yang¹, Zhiming Sun³

作者单位 +展开

作者单位

¹ Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China.

² Department of Orthopedics, Tianjin Medical University Baodi Hospital, Tianjin, 301800, China.

³ Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China. szhm0618@163.com.

DOI: 10.1186/s13018-024-04996-2 PMID: 39289734

摘要中英对照阅读

Backgrounds: The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making.

Methods: We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 ≤ TS ≤ 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses.

Results: In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before.

Conclusions: Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.

Keywords: AI; ChatGPT; Glucocorticoid-Induced osteoporosis; Google Gemini; Large language models.

Keywords：large language models; chatgpt; google gemini

背景： 在医学中使用大型语言模型（LLMs）可以帮助医生通过提高医疗信息管理、患者护理、医学研究和临床决策的效率，来改善医疗服务的质量和效果。

方法： 我们收集了关于糖皮质激素诱发性骨质疏松症（GIOP）的34个常见问题，涵盖了疾病临床表现、发病机制、诊断、治疗、预防以及风险因素等相关主题。另外，根据2022年美国风湿病学会发布的糖皮质激素诱发性骨质疏松症防治指南（2022 ACR-GIOP Guideline）生成了25个问题，并将这些问题分别提问给ChatGPT-3.5、ChatGPT-4和Google Gemini这三个大型语言模型。由三位资深骨科医生独立评分，评分范围为1至4分之间。总评分为TS的分数大于9表示“好”，6≤TS≤9表示“中等”，而TS小于6则表示“差”。

结果： 对于GIOP和2022 ACR-GIOP指南的一般问题，Google Gemini提供的答案比其他LLMs更为简洁。在发病机制方面，ChatGPT-4的总评分（TS）显著高于ChatGPT-3.5。针对2022 ACR-GIOP指南的问题，ChatGPT-4的回答总评分数明显高于Google Gemini。对于ChatGPT-3.5和ChatGPT-4而言，自我校正后的总评分显著高于预校正的总评分；而对于Google Gemini而言，其自我校正的回答并未显示与之前的回答有显著差异。

结论： 我们的研究结果显示，Google Gemini提供的答案比ChatGPT-3.5和ChatGPT-4更加简洁直观。在回答GIOP的一般问题以及2022 ACR-GIOP指南相关的问题方面，ChatGPT-4的表现明显优于其他两个模型（ChatGPT-3.5和Google Gemini）。此外，在自我校正能力上，ChatGPT-3.5和ChatGPT-4表现出比Google Gemini更强的能力。

关键词： AI；ChatGPT；糖皮质激素诱发性骨质疏松症；Google Gemini；大型语言模型

关键词：大型语言模型; ChatGPT; Google Gemini

翻译效果不满意？用Ai改进或寻求AI助手帮助，对摘要进行重点提炼

相关内容

期刊名：Journal of orthopaedic surgery and research

缩写：J ORTHOP SURG RES

ISSN：1749-799X

e-ISSN：

IF/分区：2.8/Q2

文章目录更多期刊信息

全文链接

官方链接

PMC全文

引文链接

复制

已复制！

格式：

Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis

大型语言模型在糖皮质激素诱导的骨质疏松症中的性能比较分析：ChatGPT-3.5、ChatGPT-4和Google Gemini的对比研究翻译改进

Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison

大型语言模型在青光眼手术病例中的辅助作用：ChatGPT与Google Gemini对比研究

Evaluating the Potential of Large Language Models for Vestibular Rehabilitation Education: A Comparison of ChatGPT, Google Gemini, and Clinicians

评估大型语言模型在前庭康复教育中的潜力：ChatGPT、Google Gemini与临床医师的比较

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

评估大型语言模型在后葡萄膜炎性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析

Comment on: "Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3"

对文章《uveitis中大型语言模型性能基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析》的评论

Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'

回复《评论关于：青光眼中大型语言模型性能基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析》

ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents

ChatGPT、Bard和Bing聊天机器人回答骨科住院医师考试题的准确性与第一年骨科住院医师相似

Erratum for: The Era of ChatGPT and Large Language Models: Can We Advance Patient-centered Communications Appropriately and Safely?

对《ChatGPT和大型语言模型的时代：我们能否适当而安全地推进以患者为中心的沟通？》一文的勘误通知书

The Era of ChatGPT and Large Language Models: Can We Advance Patient-centered Communications Appropriately and Safely?

ChatGPT和大型语言模型的时代：我们能够适当且安全地推进以患者为中心的沟通吗？

Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis

大型语言模型在糖皮质激素诱导的骨质疏松症中的性能比较分析：ChatGPT-3.5、ChatGPT-4和Google Gemini的对比研究 翻译改进

Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison

大型语言模型在青光眼手术病例中的辅助作用：ChatGPT与Google Gemini对比研究

Evaluating the Potential of Large Language Models for Vestibular Rehabilitation Education: A Comparison of ChatGPT, Google Gemini, and Clinicians

评估大型语言模型在前庭康复教育中的潜力：ChatGPT、Google Gemini与临床医师的比较

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

评估大型语言模型在后葡萄膜炎性能的基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析

Comment on: "Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3"

对文章《uveitis中大型语言模型性能基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析》的评论

Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'

回复《评论关于：青光眼中大型语言模型性能基准测试：ChatGPT-3.5、ChatGPT-4.0、Google Gemini和Anthropic Claude3的比较分析》

ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents

ChatGPT、Bard和Bing聊天机器人回答骨科住院医师考试题的准确性与第一年骨科住院医师相似

Erratum for: The Era of ChatGPT and Large Language Models: Can We Advance Patient-centered Communications Appropriately and Safely?

对《ChatGPT和大型语言模型的时代：我们能否适当而安全地推进以患者为中心的沟通？》一文的勘误通知书

The Era of ChatGPT and Large Language Models: Can We Advance Patient-centered Communications Appropriately and Safely?

ChatGPT和大型语言模型的时代：我们能够适当且安全地推进以患者为中心的沟通吗？

大型语言模型在糖皮质激素诱导的骨质疏松症中的性能比较分析：ChatGPT-3.5、ChatGPT-4和Google Gemini的对比研究翻译改进