The present study aimed to investigate the accuracy of answers provided by ChatGPT 3.5 in response to 12 frequently asked questions on glenohumeral osteoarthritis. In addition, we evaluated whether the language used and the wording influence the accuracy of the answers. The questions were formulated in English and German, using both common and medical terms for glenohumeral osteoarthritis, creating four distinct groups for evaluation.
The rise of the Internet has dramatically transformed the accessibility of healthcare information, making medical knowledge more public. Many patients use the Internet—whether through conventional websites, online forums, or social media platforms—to seek medical information including symptoms, diagnoses, and treatment options. This accessibility of information has not only changed the way patients access information but has also played a crucial role in empowering patients. Consequently, the traditional patient–doctor dynamic has evolved, fostering a more collaborative approach to healthcare decision-making. In the past few years, chatbots like ChatGPT have emerged and have started to form part of the online environment.
ChatGPT is an interactive AI chatbot developed by OpenAI and based on the GPT 3.5 language model (ChatGPT, OpenAI, 2021: https://openai.com/). The free version of ChatGPT uses GPT 3.5, which has been trained on extensive text data from a variety of online sources up to September 2021 using advanced deep learning techniques. Recently, ChatGPT has gained significant attention in the medical community, as it has been shown to have strong contextual understanding and ability to engage in coherent conversations, producing human-like responses [17, 18]. Past studies have shown that ChatGPT has the capacity to accurately answer multiple-choice questions from the field of medicine, including questions from exams such as the United States Medical Licensing Examination (USMLE) and the German state medical examination [19,20,21]. However, the accuracy of ChatGPT’s answers to medical questions seems to diminish as the complexity and taxonomy of the questions increases [22].
ChatGPT is also increasingly being explored as a tool for patient education and support, providing accessible information on medical conditions and treatment options. While not a replacement for professional medical advice, ChatGPT could enhance patient engagement, offer quick responses to common questions, and support personalized learning experiences, making healthcare information more accessible and easier to understand. Studies on the use of ChatGPT for patient questions have been published in the field of orthopedics including periprosthetic joint infections [23, 24]. However, investigations on the accuracy of ChatGPT answers to patients’ questions on osteoarthritis of the shoulder are still missing. Although less common than osteoarthritis of the knee or hip, glenohumeral osteoarthritis places a substantial burden on patients, often resulting in pain, restricted mobility, and diminished quality of life due to impaired shoulder function.
Overall, the accuracy of the answers provided by ChatGPT 3.5 was good for all groups. Except for two questions related to the progression of glenohumeral osteoarthritis and dietary recommendations, ChatGPT 3.5 demonstrated good accuracy in answering the questions posed across all variations in language and terminology. However, it must be acknowledged that there were modest differences in the accuracy of the answers across the different groups. Questions in English using the term “glenohumeral osteoarthritis” received the highest average score on the 5‑point Likert scale (0–4) of 3.9, indicating a relatively high accuracy of the generated responses. On the other hand, questions using the term “shoulder arthrosis” had the lowest average score on the 5‑point Likert scale of 3.2, suggesting a comparatively lower accuracy. This highlights the potential impact of terminology on the performance of ChatGPT 3.5 in providing accurate information. Overall, answers to questions posed in English and German showed similar and good accuracy. Further research should explore the reasons behind the modest discrepancies in accuracy among the different language and terminology groups.
Comparable to the presented study, Gordon et al. showed that ChatGPT 3.5 can respond accurately and consistently to patients’ imaging-related questions and can therefore enhance patient communication [16]. The same study demonstrated that prompts reduced response variability and offered more selected information [16]. Mika et al. reported that ChatGPT can answer questions on hip arthroplasty in an evidence-based and understandable way [25]. A study by Ayers et al. showed that the chatbot responses to patient questions posted on social media had a significantly higher quality and level of empathy than responses given by physicians [26]. Despite the numerous advantages in the use of ChatGPT in medicine and especially in patient care and education, certain disadvantages and ethical considerations must be considered. It is crucial to recognize that the effectiveness of AI-driven conversational systems is heavily influenced by the data provided. It must be critically noted at this point that ChatGPT 3.5 only uses data produced before January 2022. Furthermore, although ChatGPT 3.5 shows acceptable results for information processing and generation, it may not demonstrate the same degree of originality, creativity, and critical thinking typically demanded in the medical field [3]. It is important to understand the potential of AI-driven conversational systems like ChatGPT 3.5 in healthcare and especially in the provision of accurate and evidence-based information to patients.
LimitationsThe present study has certain limitations. Firstly, the questions were formulated by simulating an individual patient experience. Here it must be noted that the results of a Google search depend on various factors such as location, search history, and other individual settings, which can lead to different outcomes in the search results.
Secondly, the study focus lies on the accuracy of the generated answers alone, without considering other aspects such as clarity, comprehensiveness, or the provision of additional relevant information. Assessing these factors could provide a more holistic understanding of the utility of ChatGPT 3.5 in addressing patient queries on glenohumeral osteoarthritis. All surgeons acting as consultants for this study were German native speakers, and thus the German responses might have been scrutinized more critically. Only four surgeons from the same clinic assessed the accuracy of the responses, which could pose a potential bias due to possible in-house particularities.
Comments (0)