Evaluating the impact of the Radiomics Quality Score: a systematic review and meta-analysis

This systematic review and meta-analysis provide a summary insight into the adherence of radiomics studies to the RQS. Criticism of the RQS was found in 60/130 (46.2%) of review papers, and more than half of review papers utilised other evaluation tools and checklists. Readers incorrectly applied or summated RQS criteria in almost 40% of the 98 reviews investigated. The mean RQS of quality assessments from 117 review papers was 9.4 (26.1%) and temporal analysis indicates the RQS of radiomics studies has increased with time, along with improvements in 10/16 criteria. Radiomics studies investigating US exhibited the highest mean RQS, followed by MRI, CT, and lastly PET.

Progress on implementing phantom studies, test-retest studies, external validation, prospective studies, cost-effectiveness analysis, and open science in radiomics studies is minimal or insignificant. Additionally, investigating biological correlates appears to be significantly decreasing over time. Researchers may consider these components of the RQS as difficult to implement, non-applicable, or not relevant to radiomics analysis. Phantom, test–retest, and prospective studies are resource-intensive, which is impractical when most radiomics papers are explorative and for development purposes. Furthermore, since most studies are developmental, cost-effectiveness analysis is non-applicable to most researchers, which would occur in the latter stages of clinical translation. Open science appears to not be prioritised by researchers; however, most studies utilise image biomarker standardisation initiative (IBSI) compliant radiomics extraction software [22]. Notably, external validation and prospective study criteria form one-third of the maximum score (12/36), which developmental studies won’t receive, echoing the criticism that some criteria are “too penalising”. Lastly, there is a potential trend in biological correlates being viewed as less relevant to radiomics analysis by researchers. Therefore, priorities in the radiomics community may be evolving.

Tomaszewski and Gillies [39] emphasised the importance of investigating the biological underpinnings of predictive radiomics features, and interest still remains in exploring the relationship between imaging characteristics and tumour genetic expression [40, 41]; however, recent community guidelines and initiatives do not appear to exhibit the same emphasis for clinical translation [27, 42,43,44]. In particular, the METRICS tool represents an important community initiative on radiomics study evaluation, which does not include biological correlates [27]. Where the points for the RQS criteria are arbitrary, the METRICS tool weighs items based on expert opinion. Interestingly, the category in METRICS with the lowest weight is open science, which reflects our observations, although its inclusion still reflects its importance to the radiomics community. Lastly, the applicability of the RQS to deep learning studies has also been addressed as a conditional item for the use of end-to-end deep learning pipelines, providing flexibility to authors who may wish to investigate hand-crafted radiomics features, automatically learned features by neural networks or both. Other resources also continue to evolve, for example, CheckList for Artificial Intelligence in Medical imaging (CLAIM) received an update in 2024 and phase two of the IBSI addressed standardisation of imaging filters [42, 45].

Naturally, our results can be compared to Spadarella et al, which included 44 systematic reviews in their analysis [24]. Like their study, we found no significant difference in mean RQS between oncology and non-oncology-focused reviews. In contrast, reviews of neuro-imaging applications of radiomics had a significantly lower mean RQS when compared to other imaging areas. This significance was not observed when repeating this analysis on data extracted from only reviews included by Spadarella et al [24], so we extended the subgroup analysis to each criterion for insight. Neurology was notably low for multivariable analysis and comparison to a “gold standard”; therefore, further focus may be required in these areas. Additionally, neuro-oncology does not have a clear “gold standard” to compare to, such as TNM staging, which may partially explain poor performance due to “applicability”. In comparison, breast imaging studies were significantly higher and were more likely to conduct reproducibility analyses, feature selection, biological correlation, comparison to a “gold standard”, and open science.

A significant difference in quality assessments was observed between radiomics studies which investigate different imaging modalities. Namely, we compared the most investigated modalities with sufficient sample sizes: US, PET, CT, and MRI. Notably, studies extracting features from US tended to exhibit a higher mean RQS, which may be because US is often readily available, non-ionising, and has short scan times. PET radiomics studies exhibited the lowest mean RQS, which, in contrast to US, is resource-intensive, ionising, and has long scan times. As such, these attributes may impact dataset sizes and the implementation of reproducibility studies, resulting in a lower RQS.

A reproducibility study of the RQS by D’Antonoli et al found poor-to-moderate ICC agreement of quality scores amongst observers, independent of their experience and initial training [26]. Kappa values of the criteria ranged from −0.21 to 0.75. The agreement, as measured by the ICC and kappa, appears lower compared to the agreement reported in the present analysis, potentially highlighting an inherent bias in self-reported agreement within institutions. These results reflect the “reproducibility” criticism due to purported unclear criteria definitions. Indeed, a lack of inter-reader agreement was noted in the earliest use of the RQS [25]. Since some measurements of the agreement were based on criteria that are rarely implemented in radiomics studies, we reported kappa values alongside observed adherence. As expected, agreement is generally high with these criteria. Furthermore, kappa values are unstable with skewed marginal distributions (prevalence) of ratings [46, 47].

There are some limitations to this meta-analysis. Firstly, we did not account for overlap in quality assessments of radiomics studies by readers in separate reviews. Secondly, we relied on reviews to accurately report the publication year, imaging modality, and country of radiomics studies. Thirdly, excluded systematic reviews that omit the RQS may include studies of lower quality, resulting in selection bias. Additionally, a keyword search of “radiomics” since 2022 retrieved over 9400 studies indexed in Scopus, a majority of which would not have been assessed in the reviews we included. Fourthly, we attempted to correct errors in criteria application in reviews; however, the intent of the original score cannot be known with certainty. Nevertheless, we believe a consistent application of each criterion across reviews was required for the meta-analysis; an approach which subsequently revealed frequent, incorrect application of the RQS criteria. Importantly, all systematic reviews and reported quality assessments will include noise, and it has been demonstrated that applying the RQS to the same radiomics study is highly variable [26]. To overcome this, we extracted an extremely large sample size of quality assessments to robustly identify trends in the radiomics literature. Lastly, to better ensure accurate reporting in the future, our group has developed an RQS calculator (https://uwa-medical-physics-research-group.github.io/RQS-calculator/).

We’ve demonstrated that radiomics studies are increasingly adhering to the criteria of the RQS. However, the observed progress of a majority of studies to date has not demonstrated a sufficiently high level of evidence for clinical translation. The RQS has demonstrable shortcomings, and radiomics has rapidly evolved since its inception, spurring the emergence of new appraisal tools and community advancement. Importantly, if the field of radiomics can identify a small subset of features that are generalisable, robust, and predictive, which are then rigorously validated, clinical translation will be achievable.

Comments (0)

No login
gif