A diverse group of eye clinicians evaluated ChatGPT-4 responses to frequently asked questions regarding AMD as coherent, factual, comprehensive and safe. The coherency domain was evaluated highest, followed by safety, factuality and comprehensiveness. Whilst the Likert scores were generally agreeable at the group level, evaluator ratings were variable, with over half of the questions scoring below 4 within the factuality and comprehensiveness domains. Free-text comments identified areas of deficit, and a substantial number of questions scored below an “agree” level across key quality domains, especially with respect to factuality, specificity and applicability of the information, and its contextualisation.
Performance of ChatGPT-4 in responding to frequently asked questions in AMDFerro Desideri et al. [15] compared the three LLMs in answering general medical advice (15 questions) and advice related to intravitreal injections (13 questions) for AMD and used three retina specialists to assess their accuracy and sufficiency (comprehensiveness). Specific to ChatGPT performance, the authors found that 12/15 responses to general medical advice questions were deemed accurate and sufficient, and the other three responses were partially accurate and sufficient. For the questions related to intravitreal injections, 10/13 responses were deemed accurate and sufficient, and three were partially accurate and sufficient. These results suggested an optimistic view of LLM responses. However, their study did not report on what characteristics of the responses were deemed to only be partially accurate. Furthermore, although the authors reported a high level of sufficiency of responses (analogous to comprehensiveness in our present work), our results demonstrated lower ratings in this quality domain. The methodological approach also differed, as our study used a Likert scale, which provides more granularity than their trinarised, descriptive rating. As stated in the Methods, a 5-point scale permits the expression of more “extreme” views and more tempered opinions (and thus granularity), relative to a 3-point scale, whilst maintaining more efficiency and potentially better test-retest reliability compared to bigger scales, such as a 10-point scale [20].
Cheong et al. [16] evaluated the responses of several chatbots, including ChatGPT-4, to questions related to the macula and retina. Three fellowship-trained retinal specialists evaluated the chatbot responses using a 3-point Likert scale (0–2) and summed the scores across the graders to reflect a consensus approach to evaluation. They found that 83.3% of ChatGPT-4’s responses to the AMD questions were “good” (their highest rating), with none of the responses deemed “poor” (their lowest rating). ChatGPT-4 (and 3.5) outperformed the other chatbots in the study, and the authors concluded that they are potentially capable of answering questions related to retinal diseases, such as AMD. Differences between our present study and the work of Cheong et al. [16] included the scope of questions, and the method of grading. Their list of AMD questions was mostly thematically related to treatment and associated advice, such as vitamins and processes related to intravitreal injections, with some questions being highly specific (such as a question related to verteporfin (Visudyne, Bausch and Lomb, Ontario, Canada) and ranibizumab (Lucentis, Novartis AG, Basel, Switzerland); notably, the chatbot specifically used trade names, rather than the generic term). While their consensus approach was useful to obtain an overall impression of quality, it did not facilitate analysis of the variability across graders.
Muntean et al. [17] conducted a study comparing ChatGPT-4, PaLM2 and three ophthalmologists’ responses to specific scenario questions, incorporating a background vignette (such as that asker of the question is a patient with AMD) that may be relevant to formulating the result. Using these permutations, the authors analysed the results of 133 questions along six axes of quality, some of which overlapped with our quality domains. Using two ophthalmologist reviewers, the authors reported very positive results for ChatGPT 4 responses, with 88–100% of responses obtaining a perfect score of 5 (on a 5-point Likert scale) which were higher compared to our results. Key differences between their methodology and the present study could explain the differences in results. One difference was the comprehensiveness of the system and user prompts input by Muntean et al. [17], which includes several important caveats, two of which were to ask the chatbot to explain why a question may not make sense instead of answering a confusing or incorrect question, and to not share false information if the chatbot does not know the answer. There were many instances in the present study where the information was not accurate or relevant to the question, which could be addressed by the inclusion of these prompts. Prefacing and contextualising the question could assist in provide more relevant and safe advice in the responses. Despite the optimism across most of the quality domains, Muntean et al. [17] also highlight the deficits related to the responses in terms of their reflection of clinical and scientific consensus (i.e. contemporaneous and correct medical knowledge) and not missing important information, similar to the criticisms raised in our results.
Overall, previous literature related to chatbot usage in AMD has been mostly positive, especially regarding the accuracy and comprehensiveness of responses. However, our study was comparatively less positive, possibly due to a greater diversity of graders, a wider range of questions and the use of a 5-point Likert scale across more domains of quality. Unsurprisingly, whilst coherency was the top-rated domain, its importance is arguably lower than that of safety and factuality, as these reflect the potential risks to the community with unsupervised chatbot use.
Variability in evaluations amongst evaluators and by professional groupThe diverse team of raters in the present study indicates that the accuracy or utility of chatbots may differ depending on clinical setting and the patient base. For example, general optometric practices are more likely to see patients at risk of AMD or with earlier stages of AMD. Conversely, specialist ophthalmology clinics are more likely to see patients with more advanced stages of AMD and those requiring treatments, such as intravitreal injections. Other specific services, such as low vision clinics and collaborative care settings, may also impact the patient base and information expected from the chatbot [22, 23].
The optometrist group returned lower ratings in comparison to the ophthalmologist group. One explanation for this may be the more conservative attitude of the optometrist group, which comprised clinicians working in a primarily academic setting. Criticisms related to comprehensiveness of chatbot responses may reflect a professional habit of covering more information and content, given more attendance time by the professional group. The academic clinical setting may reflect a more critical attitude of the optometrist group in the present study, seeking more precise language regarding chatbot outputs.
Another explanation is the potential heterogeneity amongst all raters, and the acceptability of different levels of precision of chatbot statements. Although there are guidelines for care of patients with AMD [24,25,26], differences at the professional level may also inject biases into interpretation of chatbot outputs. Despite authoritative guidelines, it is also known that consensus on statements regarding AMD within and between professions may be difficult to achieve, due to the wide heterogeneity of clinical practices and patient presentation [27].
Separating quality domains in evaluating chatbot responsesCoherency being rated highest was expected, given the nature of LLM chatbot technology [28]. This domain of chatbot response quality tends to be highly rated within the literature across many fields. One notable issue was the lack of citations in some of the responses [29].
Safety-wise, a feature of many of the responses were recommendations for seeking expert advice from an eye care professional. This was particularly important for the treatment themed questions. However, several questions were rated poorly in safety for other reasons, most notably due to poor advice regarding unnecessary tests or interventions. An example that was repeatedly criticised was genetic testing, which, at the time of the study, is not a routine clinical test for AMD [30].
Factuality also had many questions with suboptimal ratings. An issue that was raised by Muntean et al. [17] was the role of system prompts to ensure an appropriate answer, and the responses to our approach further highlighted flaws of information saliency. Several of the chatbot responses may have been strictly true, but were far removed from routine clinical practice, and the lack of prioritisation of important information meant that the facts were not accurately represented.
The problem of information saliency was also reflected in low comprehensiveness scores. The chatbot responses would sometimes include niche information, such as low vision aids and telescopes. Muntean et al. [17] attempted to pre-empt this limitation by adding a patient’s scenario to preface the question. However, again, a layperson using LLM technology may not have the expertise to add this information to optimise the response. A limitation of pre-trained LLMs is the potential dated information, where emerging technologies and treatments cannot be included in the responses.
LimitationsWe have previously described the limitations of the subjective rating approach to evaluating LLM responses [18]. Combinations of multi-point Likert or other granular scales and having more graders may help to overcoming skewed subjective data. Although 5-point Likert scales are more granular than trinary scales, there is still the potential for ceiling or floor effects [31]. This was seen with many of the questions having a score of 4 or greater. Studies of this nature also lack a ground truth, instead relying on validity determined by experts. Reference standards are available when comparing across different LLMs or expert human-generated outputs, but these also have issues with subjectivity.
Our list of questions was curated from several authoritative sources, and were, in large part, simplified for the purposes of brevity. As described above, how questions are input into chatbots may contribute to response generation. Our goal was to keep the questions simple and broad. Future studies with more granularity could provide further insights.
Finally, to further understand clinical implementation would require end-user input, such as patients at risk of or who have AMD. Alongside further stakeholder consultation, there are well documented ethical challenges occurring in parallel to clinical issues of accuracy, with many concerns such as privacy and security, intellectual property, transparency and accountability, bias and explainability, amongst others [32]. This is another consideration for clinicians prior to widespread deployment.
Comments (0)