In the present study, the ML models developed by Grant et al. on a Canadian population [16] were tested on an external cohort of Italian syncope patients to assess their effectiveness in risk prediction. In addition, the performance of these models [16] was compared with that of novel DL algorithms.
We found that the classic ML model GB, achieved the highest AUC values. Moreover, these traditional ML models outperformed the recently developed DL models utilized in this study for syncope risk prediction.
The results of our study indicate that the prediction capability of the previously established models GB and LR [16] decreased when validated on the current cohort of syncope patients, as evidenced by lower AUC values (0.78 and 0.75, respectively) compared to the original results (0.91 and 0.90, respectively). Several points should be considered in interpreting these findings, as the Italian external validation cohort in the current study significantly differed from the original derivation cohort [16].
Firstly, the Italian validation cohort comprised individuals presenting with syncope at six different EDs, all characterized as non-low risk based on the assessment of the physician in charge [26]. According to guidelines and position papers [2,3,4,5,6], because of the non-low- risk profile, syncope patients should undergo further, detailed evaluation for appropriate management, which might include clinical observation with continuous monitoring in the ED [3, 27, 28], hospital admission to an adequate ward with medium-term monitoring facilities, or fast-track referral to a Syncope Unit for expedited outpatient assessment [29]. Given the complexity of a clinical decision making in this context, we reasoned that selecting a non-low-risk population would provide the most suitable cohort to test the effectiveness and practical utility of Grant et al. risk stratification algorithms [16]. Not surprisingly, in our non-low-risk population, the number of adverse outcomes was found to be greater than that of the Canadian population, with a rate of events occurring in 13% compared to 3.6% in the Canadian derivation cohort [16]. This discrepancy can also be partially explained by differences between healthcare systems and local syncope management protocols in the two countries, such as the common practice of immediate ED discharge for low-risk syncope patients in Italian hospitals. Lastly, the mean age of the Italian validation cohort was higher than the original Canadian derivation cohort, with a median age of 71 years compared to a mean of 54 years in the Canadian training and test sets [16].
Taken together, these differences indicate substantial variations in the case mix between the two cohorts, likely explaining the observed discrepancies in predictive performance of the original models in the current investigation.
Novel risk stratification tools should take into account differences in healthcare systems where they are implemented. Over-diagnosis in resource-limited settings may be counterproductive, as these systems might lack the capacity to manage a high volume of false positives. This could lead to unnecessary investigations, increased strain on limited resources, and potentially overlooking other critical conditions. The implementation of AI could enhance efficiency and diagnostic accuracy without further burdening already stressed healthcare systems.
The CSRS was recently proposed as a new decision rule for syncope risk stratification, showing high sensitivity and accuracy within a Canadian clinical setting. We recently validated CSRS on an Italian syncope cohort using classical statistics and found its predictive accuracy to be similar to clinical judgement [29]. In addition, the latter would have enabled the discharge from the ED of fewer patients suffering from 30-day adverse events compared with CSRS.
In the present study, we employed the CSRS predictors with novel algorithms: a DL model, TabPFN, and a large language-based model, TabLLM, to assess their prediction capability on a multicentre Italian clinical dataset. Notably, both TabPFN and TabLLM models were previously developed and used to improve and solve classification problems, particularly the extraction of medical information from tabular data [30].
The results of the current study suggest that both TabPFN and TabLLM models underperformed compared to the classic ML-based models, such as GB, and to LR [16]. A possible explanation for this limitation may relate to the tabular attributes of our dataset largely comprehending binary data. The latter is known to bear limited amount of information compared to numerical data. Supporting this hypothesis, age emerged as a significant predictor of adverse outcomes according to the GB model. Notably, age was numerical data within the validation dataset and therefore likely conveyed more substantial information compared to simple binary (yes/no) data. These aspects, combined with the limited size of our study population, might have impaired the effectiveness of both TabPFN and TabLLM algorithms in accurately extracting medical information for a proper syncope risk stratification [30, 31].
Compared to other risk stratification tools, the algorithms presented in this study achieved a numerically lower AUC. In their respective validation studies, the OESIL score reported an AUC of 0.89, the EGYS score 0.80, and the SFSR 0.92 [11, 31,32,33,34]. However, several important differences should be noted. First, the mean age in the present study population was higher, indicating a more fragile cohort potentially subject to multiple concurrent factors contributing to adverse events. Second, the algorithms developed in this study utilized a greater number of predictors compared to the EGYS, SFSR, and OESIL scores, which may reduce their generalizability. Finally, the predictors used in our model included specific ECG criteria that differed from those used in the other scores, which could also have influenced the algorithm’s overall performance.
LimitationsWe acknowledge a few limitations of the current study. Although based on a multicenter dataset, our validation cohort consisted mostly of data obtained from a relatively small number of patients, which might have affected the performance of Grant et al. models [16]. However, as highlighted in a recent publication [17], homogeneous and focused datasets, such as the one used in our study, are likely to improve data accuracy and yield reliable AI-based prediction even with limited sample sizes. In the present study, specifically using a non-low-risk syncope cohort for validation purposes likely led to substantial differences in the case mix between the Canadian derivation population and our Italian validation population, potentially impacting the predictive effectiveness of the Grant et al. models [16]. This further underscores that a syncope risk model developed from one specific population may not be easily generalizable to other syncope cohorts without affecting the model’s prediction capability.
The role of troponin as a predictor in this investigation deserves additional commentary. Troponin values frequently were missing in our validation dataset. Given that troponin is a highly specific marker for myocardial ischemia, it is reasonable to hypothesize that ED physicians ordered this test only when an ischemic cause of syncope was strongly suspected, consistent with guideline recommendations [2, 3] and a Choosing Wisely approach [32, 33]. We would like to point out that we tested three different methods for automatic troponin data imputation (SimpleImputer, IterativeImpuiter and KNNInputer) and found no significant differences in the GB model’s performance. Thus, given the high number of missing values, troponin’s utility as a reliable predictor of adverse outcomes in this context is limited. In contrast, other variables such as QRS axis and age, which were included in the GB model but not in the LR model, may have contributed to the superior predictive effectiveness observed for the GB model in our study. Indeed, these variables are usually available for every syncope patient. Moreover, an abnormal QRS axis is a strong indicator of cardiac disease, aiding in the identification of patients with underlying structural heart disease [34].
It is important to point out that in order to make models understandable, SHAP and LIME methodologies might have been used effectively. However, since the current study aim was to validate a previously set model, it seemed out of the scope of the present investigation to address that issue. Following this line of thought, also potential novel predictors were not added.
Comments (0)