In this study, we utilized cases from Radiology’s “Diagnosis Please,” a radiological image diagnosis quiz for radiologists. When provided with additional information that may influence prior probabilities, specifically informing Claude 3.5 Sonnet that these were quiz cases, we observed a significant improvement in the diagnostic performance of the LLM (overall accuracy rate increased from 64.9% to 70.2%). Conversely, when given incorrect additional information, such as setting the context as a primary care clinic, a significant decrease in diagnostic performance was observed (overall accuracy rate decreased from 64.9% to 59.9%).
Accurate performance evaluation and understanding of LLM characteristics are crucial for exploring their potential applications. Case collections like “Diagnosis Please,” which provide detailed clinical histories, radiological images, and confirmed final diagnoses that can be logically deduced from the given information, are well-suited for LLM performance evaluation (e.g., comparisons between vendors or versions within the same vendor) and assessing similarities and differences with human radiologists. Various authors have previously conducted such studies [1,2,3,4,5,6, 11]. However, while previous studies have assigned the role of a radiologist to LLMs, they have not informed the LLM that the cases were from quizzes. We believe this created a non-negligible gap that should be addressed.
Human radiologists can adjust their diagnostic approach based on institutional and regional characteristics. This is based on Bayes’ theorem, which describes the probabilistic nature of clinical reasoning and indicates that accurate recognition of current conditions affecting prior probability is essential for improving diagnostic accuracy [8, 9]. Given this context, we hypothesized that providing LLMs with additional information that may influence the prior probabilities of diseases in the target patient group/cohort would improve their diagnostic performance. Furthermore, we posited that providing accurate vs. inaccurate additional information would lead to variations in diagnostic performance for the same set of cases.
The results supported our hypothesis. In Condition 2, where we provided prompts that aligned the assumed prior probabilities with the situation, emphasizing rare diseases, the performance improved. In contrast, in Condition 3, where we provided prompts that deviated from this assumption by emphasizing common diseases, the performance declined. These results suggest that providing prompts that may influence prior probabilities produces outcomes analogous to the effects on human clinical reasoning, although the reasoning process of LLMs remains a black box. This finding implies that LLMs may be incorporating Bayesian-like principles in their diagnostic approach. In addition to optimizing the reasoning steps within the LLM by improving the way information is provided, as shown by Sonoda et al. [11], it was also shown that the diagnostic ability of the LLM can be enhanced by teaching the nature of the given information externally through prompt engineering.
It is difficult to properly consider the reasons that led to the present results because the exact mechanism of LLM is not provided by vendors. We feel that similarity with the concept of “data poisoning attack” may be helpful in explaining why setting a false situation of primary care in this study reduced the diagnostic performance of LLM. Data poisoning attack is a technique that induces LLM to generate erroneous information or biased responses by injecting harmful information into the LLM’s training data. It is known that LLM can present false facts when exposed to data poisoning attack [17]. Although this study did not use methods such as data poisoning attack that directly intervene and manipulate the training data, giving different assumptions by prompt engineering is a similar concept. Prompt engineering can be used to set up situations that LLM deviates from previously learned data, which may have led to a reduction in diagnostic performance. This suggests that we should set up situations which avoid circumstances similar to data poisoning attack when LLM performs diagnostic tasks in order to improve diagnosis accuracy. In other words, it is important for better diagnosis to make an effort to collect as much information about prior probability as possible and apply it to prompt engineering. Note that in the field of machine learning, it is known that the performance of a classifier improves when the correct prior probabilities are input [18].
The results of this study suggest several directions for future research. Just as informing the LLM about the quiz nature of the cases improved its diagnostic performance, providing context about real clinical situations may enhance LLM performance in actual clinical settings. For instance, supplying the LLM with information from databases on prevalence rates specific to regions or institutions could optimize its diagnostic results for those particular settings, making it more valuable in our practice. This underscores the growing importance of developing databases for individual regions and institutions. Proper ethical review processes will be essential to enable the input of clinical data into LLMs. In addition, we performed verification with only Claude 3.5 Sonnet model because it showed the best results in a previous report [4], testing the present issue using other LLMs such as GPT-4o and Gemini 1.5 pro would be helpful in understanding whether the present results are specific to the Claude 3.5 Sonnet or whether they are also generally obtained for other LLMs.
This study has several limitations. The analysis was based on a limited number of cases, precluding subgroup analysis by disease categories. Because LLMs do not always return the same response to the same prompt, retests may yield different results. As the answer criteria set by the “Diagnosis Please” creators are not publicly available, our judgments of correct and incorrect answers may differ from the actual standards. Additionally, since all cases have been published as papers, there is a possibility that they were used in training the LLM.
We demonstrated that in radiological image diagnosis quizzes, providing prior information about the quiz nature of the cases significantly improved the diagnostic performance of Claude 3.5 Sonnet. Conversely, giving incorrect context, such as a primary care setting, significantly decreased its performance. Similar to human physicians, the concept of prior probability, as suggested by Bayes’ theorem, appears to be crucial for the LLM. This implies that constructing and providing optimized databases for specific regions and institutions to LLMs could enhance their diagnostic performances, potentially allowing LLMs to contribute more substantially to real clinical practice.
Comments (0)