In this study, we evaluated the diagnostic performance of AI models and human readers in detecting skull fractures on brain CT images. The key distinguishing factors of this study include the following: (1) Both post-operative and pediatric patients were not excluded and with larger pool of cases in the training dataset, facilitating AI training under more realistic conditions; (2) two post-processing rules were employed to reduce false positives; and (3) comparative analysis of diagnostic performances and duration between human readers and AI.
Diagnostic Performance of AI ModelIn this study, two strategies were employed to combat false positives. The first strategy involved training Model 2 using false positives from Model 1. The second strategy utilized two post-processing rules.
As depicted in Table 4, Model 1 demonstrated superior sensitivity (recall) and DICE score compared to Model 2, while Model 2 displayed marginally increased specificity. It is indeed unexpected that Model 2’s performance was less impressive, despite being trained with false positives from Model 1. These cases are more challenging compared to regular negative cases, which could potentially disrupt the AI’s original logic in making judgments, resulting in labeling with less logical reasoning.
For instance, reviewing cases that have undergone surgery can be challenging as they often closely resemble skull fractures, especially in an axial view. At first, the AI might misclassify these as fractures, but through learning, it is taught to discern otherwise, leading to some uncertainty in its judgement. Consequently, the model experienced a decrease in sensitivity (recall) and DICE, yet it saw an enhancement in its specificity.
The model was trained and validated on a diverse dataset, including cases that are typically challenging for radiologists, such as subtle fractures and those obscured by artifacts. The architecture employed is a 2D U-Net, where the input images are flat 512 × 512 pixel slices. Clinically, accurately determining whether an image indeed shows a fracture often necessitates examining several consecutive slices and possibly reconstructing coronal and sagittal views. Although we incorporated adjacent slices above and below the original axial view into a three-channel input, the model fundamentally remains 2D. The 2D approach might still be prone to errors in complex cases.
Therefore, exploring the use of a 3D model in the future could potentially yield better results by allowing for a more comprehensive analysis of the spatial relationships and complexities inherent in skull fractures. However, the adoption of such a model is not without drawbacks. Firstly, 3D models are not typically equipped with pretrained weights, creating a considerable limitation. Secondly, the diminutive nature of fractures raises doubts about the efficacy of a 3D model’s broad-range search. It is plausible that such extensive scanning might not invariably result in improved performance.
The Impact of Post-processing Rules on PerformanceAbout the post-processing rules, Rule A excluded clusters smaller than 78 voxels, while Rule B excluded lesions appearing in less than four continuous slices. Both rules are designed to reduce false alarms. After applying post-processing rules, both models showed improved specificity, while sacrificing sensitivity (recall) and DICE score. The combination with the highest F1 score was Model 1 plus Rule B, which achieved high sensitivity (recall), specificity, and DICE score.
Integrating Rule A into the models only yielded a slight increase in specificity, indicating that Rule A can reduce false alarms without strongly impacting sensitivity (recall) and DICE score, possibly due to tiny non-fracture lesions. Conversely, adding only Rule B to the models notably boosted specificity, implying Rule B effectively excludes certain non-fracture lesions which are large enough but not long enough such as arachnoid granulations, despite potentially missing some fractures. In certain cases, while the fracture lines are very thin, they indeed span more than four slices. Due to the simultaneous application of Rule A and Rule B, even legitimate fractures could potentially be filtered out. The differing impact of Rule A and B suggests that fracture lesion shape may affect AI performance.
Although the DICE score is not perfect, the primary objective of this AI model is to alert radiologists to potential overlooked fractures. This rationale justifies our use of a segmentation approach for conversion into a binary classification, even with a low DICE threshold. The goal is achieved if the fracture’s location is identified, enabling radiologists to thoroughly examine the relevant areas.
False Positives (FP) and False Negatives (FN) Incorrectly Identified by AIBased on Fig. 6, we have noticed that the AI model may generate false-positive cases. Some cases even apparently display indirect signs such as hemosinus, soft tissue swelling, or orbital protuberances. As a result, it is crucial to conduct a thorough assessment of patients with indirect signs and consider their clinical history before reaching a definitive diagnosis.
Fig. 6Representative images of false-negative predictions of the model. The figure illustrates numerous instances of false negatives that escaped detection by the AI system. Some have indirect signs, such as swelling of soft tissues, sinus opacification, and asymmetry in the positioning of the orbits. Right temporal bone fracture (A). Nasal septum and inner wall of right mastoid sinus fracture (B). Lateral wall of the left maxillary sinus (C, F). Fracture of frontal sinus (D). Fracture of ethmoid sinus (E, L). Superior orbital wall fracture (G). Parietal/frontal bone fracture (H1/H2). Right zygomatic arch fracture (I). Right lateral orbital wall fracture (J, K)
In Fig. 7, there are several cases that AI misclassified them as fracture-positive cases. In summary, most of incorrect segmentation point to sutures, post-operative change, or emissary veins. This finding is consistent with previous studies [21, 22]. Although we cannot understand the methodology of AI prediction in the black box, most of the cases in Fig. 7 lack symmetry. Therefore, we speculate symmetry may be an important clue to the model.
Fig. 7Representative images of false-positive predictions of the model. False-positive cases in AI segmentation. All the red lines in the top row represent the predicted fractures, while the bottom row shows the original images without AI segmentation. Prominent right coronal suture (1a, 1b). Left coronary suture and emissary vein (2a, 2b). Burr holes (3a, 3b). Suboccipital craniectomy (4a, 4b). Frontal suture (5a, 5b). Prominent lambdoid suture (6a, 6b). Parietal emissary vein (7a, 7b) (8a, 8b)
Observer Study 1—Diagnostic Performance Without AI AssistanceIn comparison to human participants, the AI model demonstrated superior performance, boasting the highest sensitivity (recall) (87.06%) and specificity (98.60%), surpassing the best human results. The observers labeled 671 cases in a single period. However, in regular clinical practice, the volume of brain CT scans to be reviewed may be considerably less. Such an intensive review session could potentially lower focus and increase errors, potentially undervaluing human performance.
Compared to AI, human specificity remains within an acceptable range, although the sensitivity (recall) is considerably lower. Radiologists with greater experience significantly surpassed others in terms of sensitivity (recall). The neuroradiologist was the most efficient in interpretation, achieving a comparable specificity to AI in half the time of her counterparts.
Observer Study 2—Diagnostic Performance with AI AssistanceAI assistance significantly improved the diagnostic performance and reduced diagnostic duration. However, most human performances have not surpassed that of AI. The most impacted is the sensitivity (recall) of less experienced readers, which has nearly doubled. The neuroradiologist still possess the highest sensitivity (recall) among human readers. The specificity of human readers has also increased. Furthermore, the duration of diagnosis has been reduced to half or even one third of the original time.
AI has the capability to expedite the diagnostic process by automatically identifying lesions and visualizing them. By bolstering diagnostic accuracy and significantly reducing the diagnostic time, the integration of AI assistance in the diagnostic process holds great promise for improving patient care [23].
Comparison with Previous ResearchesWe have compiled a table (Table 6) to compare various details of studies on skull fractures, including our own study, which is listed in the bottom row.
Table 6 Comparison with previous researchesOur AI model demonstrated robust ability in distinguishing between fracture and non-fracture cases. This is comparable to other studies, yet our model achieves a commendable balance of sensitivity (91%) and specificity (87%), which supports its potential utility in clinical settings.
Additionally, unlike many studies that exclude pediatric patients and those with prior surgeries, our study maintains a natural clinical condition by including these groups. This inclusion enhances the generalizability and applicability of our AI tool across a broader patient spectrum, reflecting real-world scenarios more accurately.
Our dataset is notable not only for its inclusivity but also for its duration and size. Data were collected over a 12-year period, which is relatively long compared to other studies ranging from 2 to 20 years. This extensive collection period allowed us to amass a considerable number of cases, enhancing the reliability of our findings.
Comparatively, most of studies predominantly employ a segmentation approach. This method is favored for its precision in delineating intricate structures such as fractures in skull CT scans. The gold standard for validating AI performance in these studies is mostly radiological CT reports, although one study used autopsy findings (Fig. 2).
LimitationsOur study faces several limitations that may impact the interpretation and application of its findings.
Firstly, the dataset was confined to a single medical center, which may limit the generalizability of the results across different populations and settings.
Secondly, the composition of the test dataset, specifically the ratio of positive to negative cases, may not accurately reflect real-world clinical scenarios, potentially skewing the AI model’s performance metrics.
Thirdly, the validity of the results could be influenced by a learning effect, as participants evaluated the same dataset twice within a 1-month period, which might have affected their diagnostic accuracy during the second review.
Fourthly, our AI model demonstrated difficulties in distinguishing between actual fractures and artifacts from previous craniectomies. This challenge is exacerbated by the use of 2D images from single slices, which can also pose difficulties for human experts without access to sequential cuts for a more thorough assessment. We suggest that employing a 3D model in future research could potentially mitigate these issues, as it would offer a more comprehensive view of the cranial structure and better differentiate between true fractures and post-surgical changes.
Fifthly, our exploration of different proportions in the training dataset revealed that exclusively training Model 1 with positive cases inadvertently increased the AI’s propensity to predict positives, leading to a higher rate of false positives. To counteract this, we introduced a substantial number of negative cases, referred to as “hard negatives,” into the training process. Although this strategy is based on past successes in enhancing model accuracy and reducing false positives, it did not yield the expected outcomes in this instance. We hypothesize that complications related to post-craniectomy scenarios might have influenced this result. Future studies might benefit from incorporating typical cases, which could help refine the model’s accuracy further by providing a more balanced and realistic training environment.
Comments (0)