GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

 SFX Search Permissions and Reprints Abstract Background and study aims

Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.

Methods

In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.

Results

A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all P < 0.001). It outperformed comparators in six of seven tasks (P < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4–260.35) (P < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators (P < 0.001). Multivariate analysis revealed that model type significantly predicted performance (P < 0.001).

Conclusions

This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

Keywords Endoscopy Upper GI Tract - Reflux disease - Endoscopy Small Bowel - Inflammatory bowel disease - Neoplasia - Non-variceal bleeding - Pancreatobiliary (ERCP/PTCD) Publication History

Received: 03 January 2025

Accepted after revision: 07 May 2025

Accepted Manuscript online:
16 June 2025

Article published online:
06 August 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P. Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan. GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model. Endosc Int Open 2025; 13: a26372163.
DOI: 10.1055/a-2637-2163

Comments (0)

No login
gif