Journal of medical imaging and radiation oncology. 2025 Apr 8. doi: 10.1111/1754-9485.13858 Q42.22024

Comparative Performance of Anthropic Claude and OpenAI GPT Models in Basic Radiological Imaging Tasks

Anthropic Claude和OpenAI GPT模型在基本放射学影像任务中的性能比较翻译改进

Cindy Nguyen¹, Daniel Carrion², Mohamed K Badawy^{1 2}

作者单位 +展开

作者单位

¹ Department of Medical Imaging and Radiation Sciences, Monash University, Clayton, Victoria, Australia.

² Monash Imaging, Monash Health, Clayton, Victoria, Australia.

DOI: 10.1111/1754-9485.13858 PMID: 40196917

摘要中英对照阅读

Background: Publicly available artificial intelligence (AI) Vision Language Models (VLMs) are constantly improving. The advent of vision capabilities on these models could enhance radiology workflows. Evaluating their performance in radiological image interpretation is vital to their potential integration into practice.

Aim: This study aims to evaluate the proficiency and consistency of the publicly available VLMs, Anthropic's Claude and OpenAI's GPT, across multiple iterations in basic image interpretation tasks.

Method: Subsets from publicly available datasets, ROCOv2 and MURAv1.1, were used to evaluate 6 VLMs. A system prompt and image were input into each model three times. The outputs were compared to the dataset captions to evaluate each model's accuracy in recognising the modality, anatomy, and detecting fractures on radiographs. The consistency of the output across iterations was also analysed.

Results: Evaluation of the ROCOv2 dataset showed high accuracy in modality recognition, with some models achieving 100%. Anatomical recognition ranged between 61% and 85% accuracy across all models tested. On the MURAv1.1 dataset, Claude-3.5-Sonnet had the highest anatomical recognition with 57% accuracy, while GPT-4o had the best fracture detection with 62% accuracy. Claude-3.5-Sonnet was the most consistent model, with 83% and 92% consistency in anatomy and fracture detection, respectively.

Conclusion: Given Claude and GPT's current accuracy and reliability, the integration of these models into clinical settings is not yet feasible. This study highlights the need for ongoing development and establishment of standardised testing techniques to ensure these models achieve reliable performance.

Keywords: AI; Claude; GPT; healthcare; large language models; vision language models.

Keywords：Anthropic Claude; OpenAI GPT; radiological imaging tasks

背景： 公开可用的人工智能（AI）视觉语言模型（VLMs）在不断改进。这些模型的视觉能力可能会增强放射学工作流程。评估它们在放射图像解释中的表现对于其潜在集成到实践中至关重要。

目的： 本研究旨在评估公开可用的VLM，Anthropic的Claude和OpenAI的GPT，在多次迭代中进行基本图像解读任务的专业技能和一致性。

方法： 使用来自公开数据集ROCOv2和MURAv1.1的子集来评估6种VLM。将系统提示和图像输入到每个模型三次，输出与数据集描述进行比较以评估每个模型在识别模态、解剖结构以及在X光片中检测骨折方面的准确性。还分析了迭代之间的输出一致性。

结果： 对ROCOv2数据集的评估显示，在模态识别方面具有高精度，某些模型达到了100%。所有测试模型的解剖结构识别准确率在61%到85%之间。在MURAv1.1数据集中，Claude-3.5-Sonnet在解剖结构识别中拥有最高的准确性，为57%，而GPT-4o在骨折检测中表现最好，准确率为62%。Claude-3.5-Sonnet是最具一致性的模型，在解剖结构和骨折检测中的迭代一致性分别为83%和92%。

结论： 鉴于Claude和GPT目前的准确性和可靠性，这些模型尚未能够在临床环境中集成。本研究强调了持续开发和完善标准化测试技术以确保这些模型可靠性能的重要性。

关键词： AI；Claude；GPT；医疗保健；大型语言模型；视觉语言模型。

关键词：人本主义克劳德; 开放AI GPT; 放射影像任务

翻译效果不满意？用Ai改进或寻求AI助手帮助，对摘要进行重点提炼

收藏本文留言帮助扫码分享

相关内容

期刊名：Journal of medical imaging and radiation oncology

缩写：J MED IMAG RADIAT ON

ISSN：1754-9477

e-ISSN：1754-9485

IF/分区：2.2/Q4

文章目录更多期刊信息

全文链接

官方链接

PMC全文

引文链接

复制

已复制！

格式：

Comparative Performance of Anthropic Claude and OpenAI GPT Models in Basic Radiological Imaging Tasks

Anthropic Claude和OpenAI GPT模型在基本放射学影像任务中的性能比较翻译改进

OpenAI's GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence

开放人工智能的GPT-4在手术肿瘤学中的应用：生成式人工智能的革命性进展

Application of OpenAI GPT-4 for the retrospective detection of catheter-associated urinary tract infections in a fictitious and curated patient data set

开放人工智能GPT-4在虚构和整理的患者数据集中的导管相关尿路感染回顾性检测中的应用

Evaluating the OpenAI's GPT-3.5 Turbo's performance in extracting information from scientific articles on diabetic retinopathy

评估OpenAI的GPT-3.5 Turbo提取糖尿病视网膜病变科学文章信息的表现性能

OpenAI's GPT-4 performs to a high degree on board-style dermatology questions

OpenAI的GPT-4在皮肤科选择题式问题上表现出高水平性能

Utility of a LangChain and OpenAI GPT-powered chatbot based on the international consensus statement on allergy and rhinology: Rhinosinusitis

基于国际过敏和鼻科学共识声明的以LangChain和OpenAI GPT为动力的聊天机器人的益处：鼻窦炎

Comparative Performance of Anthropic Claude and OpenAI GPT Models in Basic Radiological Imaging Tasks

Anthropic Claude和OpenAI GPT模型在基本放射学影像任务中的性能比较 翻译改进

OpenAI's GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence

开放人工智能的GPT-4在手术肿瘤学中的应用：生成式人工智能的革命性进展

Application of OpenAI GPT-4 for the retrospective detection of catheter-associated urinary tract infections in a fictitious and curated patient data set

开放人工智能GPT-4在虚构和整理的患者数据集中的导管相关尿路感染回顾性检测中的应用

Evaluating the OpenAI's GPT-3.5 Turbo's performance in extracting information from scientific articles on diabetic retinopathy

评估OpenAI的GPT-3.5 Turbo提取糖尿病视网膜病变科学文章信息的表现性能

OpenAI's GPT-4 performs to a high degree on board-style dermatology questions

OpenAI的GPT-4在皮肤科选择题式问题上表现出高水平性能

Utility of a LangChain and OpenAI GPT-powered chatbot based on the international consensus statement on allergy and rhinology: Rhinosinusitis

基于国际过敏和鼻科学共识声明的以LangChain和OpenAI GPT为动力的聊天机器人的益处：鼻窦炎

Anthropic Claude和OpenAI GPT模型在基本放射学影像任务中的性能比较翻译改进