After removal of duplicates, a total of 86, 157, and 110 records were identified for predictive, prognostic, and serial testing, respectively. Of these, 55, 151, and 104 studies were excluded after abstract screening and the assessment of the full-text as they did not fulfill the eligibility criteria. Reasons for exclusion were non-English papers, other intervention or population, no cost-effectiveness outcomes reported, and full-text not available. Ultimately, 43 papers were included in total: 31 papers for predictive testing [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54], 6 for prognostic testing [55,56,57,58,59,60], and 6 for serial testing [61,62,63,64,65,66]. The flow diagram of the study selection process of the screening procedure is shown in Online Resource 3 (see ESM).
Table 2 presents the summary characteristics of the included studies for the three biomarker applications. Different testing strategies were compared between the biomarker applications, but also a wide variety of testing strategies were observed within each biomarker application. Note that the included studies for serial testing were generally older compared with the other biomarker applications, as five of the six studies were published before 2005. In general, these older studies tended to report less comprehensive information on the methodology and the data sources. A more detailed overview of the included studies, including the extracted data per study, is listed in Online Resource 4 (see ESM).
Table 2 Summary characteristics of included studies for the three biomarker applications3.2 Literature Findings3.2.1 Model Assumptions and UncertaintyOf the 43 included studies, 38 (88%) relied on distinct sources for the input data used for the test and treatment parameters (28 predictive testing, 6 prognostic testing, and 4 serial testing). These studies required linkage of test results to treatment evidence, which will be further described below. A further three (7%) studies, all predictive testing, informed their cost-effectiveness analysis with sources describing the combined effect of testing and subsequent treatments (‘end-to-end’ evidence). These three studies were not required to link test results to treatment parameters in the model. One study incorporated data from four RCTs on immunotherapy with biomarker-stratified trial designs [26]. The two other studies utilized real-world data (RWD) to inform their analyses (i.e. one national registry and one prospective observational cohort) [39, 51]. In the remaining two (5%) studies it was unclear what evidence was used as input for each parameter (both serial testing) [61, 62].
3.2.1.1 Input for Biomarker TestStudies linking sources for test parameters to different sources for subsequent treatment effects utilized a variety of input parameters related to the biomarker test to inform their models. Most studies (78%) included the test performance, although how the test performance was expressed differed across the three biomarker applications which will be further discussed below. Besides test performance, 15/43 (35%) studies also included other parameters related to the test, such as success rates of tests or biopsies, turnaround time or lead time for disease progression.
For predictive testing, test performance expressed as sensitivity and/or specificity was explicitly included in 21/28 (75%) studies. Of these 21 studies, ten studies derived the evidence for these parameters from retrospective evidence, four from prospective evidence and the remaining seven studies had a mixture of evidence, relied exclusively on expert opinion, or did not clearly report the source. Studies that did not include test performance-related parameters informed their cost-effectiveness analysis with the prevalence of mutations or the positivity rate of tests.
For prognostic testing, test performance was expressed as the difference in recurrence risk between prognostic subgroups, and was included in all six (100%) studies. Four of the six (66%) studies used a hazard ratio for recurrence risk between subgroups, one (17%) used a continuous scale relative risk which was dependent on the prognostic score, and one (17%) used time to recurrence distributions for different prognostic subgroups. Five (85%) studies based this parameter on prospective evidence, and one (17%) on retrospective evidence.
For serial testing, test performance was expressed as sensitivity and/or specificity, and was explicitly included in 4/6 (66%) studies. Two of these studies based this parameter on prospective evidence, one study on expert opinion, and the last study did not clearly report the evidence source. The remaining two (33%) studies did not clearly report how and if it was incorporated in their cost-effectiveness analysis.
Some testing strategies included multiple tests, which were performed either in parallel and/or in sequence. For predictive testing, 23/28 (82%) studies included multiple tests. Of these 23 studies, 13 (57%) studies included multiple tests performed in parallel and 19 (83%) performed tests in sequence; 12 (52%) of these studies explicitly reported on the relationship between these multiple tests. They mostly assumed that mutations were mutually exclusive. In addition, two studies incorporated a correlation between PD-L1 and other biomarker status in their models. In prognostic testing, only 2/6 (33%) studies included multiple tests. These tests were performed in parallel, and no relationship between test results was discussed. In serial testing, all six (100%) studies included multiple tests. All of these studies included sequential testing, and four (66%) also included strategies with multiple tests performed in parallel. The relationship between these tests was not reported in five (80%) of these studies. The study that reported on the relationship between tests pooled the sensitivity and specificity of all tests at one time point to estimate the combined testing performance.
3.2.1.2 Assumptions About the Adherence to the Test ResultStudies that used different evidence sources for test and treatment parameters were required to make assumptions to link a test result to the treatment effectiveness. One of the underlying assumptions includes to what extent the test result is always followed in the subsequent treatment decisions. Of the 38 studies using different evidence sources, most studies assumed in the base-case analysis that clinicians perfectly adhered to the test results in making the subsequent treatment decision (26/28 [93%] for predictive testing, 3/6 [50%] for prognostic testing, 4/4 [100%] for serial testing). However, this assumption was often not explicitly mentioned.
3.2.1.3 Assumptions About the Different Treatment Effects for Different Biomarker SubgroupsA second underlying assumption for linking test and treatment parameters from different sources concerns the treatment effectiveness for different subgroups with a different test result. For predictive and serial testing, this related to the difference in treatment effectiveness in patients with true- or false-positive test results (and true-/false-negative test results). For predictive testing, the impact of false-positive and false-negative test results was incorporated in 11/28 (39%) studies, while in serial testing, this was explicitly addressed in 2/4 (50%) studies. For prognostic testing, a differentiation in treatment effects between prognostic subgroups (low and high risk) would indicate that besides a prognostic effect, the biomarker also has some predictive effects. Two of six (33%) studies assumed different treatment effects in different prognostic subgroups in the base case analysis. In addition, one other study stated that they did not assume a different treatment effect in their model, as existing evidence had demonstrated that there was no difference between the subgroups [56].
3.2.1.4 Exploring the UncertaintyAlmost all included studies, both studies utilizing different evidence sources and end-to-end sources, conducted sensitivity analyses, including scenario analyses, probabilistic, and one- or two-way sensitivity analyses (42/43). Among the studies that included test performance, 17/21 (81%), 5/6 (83%), and 2/4 (50%) studies explored the impact of test performance in predictive, prognostic, and serial testing, respectively. The impact of the cost of testing was less often explored, with 15/31 (48%) exploring cost for predictive testing, 4/6 (66%) for prognostic testing, and cost was not explored in any of the studies for serial testing. The impact of suboptimal adherence to the test results was explored in sensitivity analyses in 2/31 (6%), 2/6 (33%), and 0/6 (0%) studies, for predictive, prognostic, and serial testing, respectively. The uncertainty around different treatment effects for different biomarker subgroups was assessed in 5/31 (13%) for predictive testing, 3/6 (50%) for prognostic testing, and 0/6 (0%) for serial testing in sensitivity analyses.
3.2.2 Reported Model Outcomes3.2.2.1 Long-Term Outcomes (of Test and Subsequent Treatment(s)) and Intermediate Outcomes (of Diagnostic Test Phase)All included studies reported long-term cost outcomes and clinical outcomes in terms of survival. Besides long-term outcomes, 67% of studies reported intermediate outcomes, which are outcomes that provide information on the impact of the test, without yet incorporating the effects and costs of subsequent treatments (22/31 [71%] for predictive testing, 4/6 [66%] for prognostic testing, 3/6 [50%] for serial testing). Costs related to the testing procedure only (i.e. costs of testing) were reported in 17/31 [55%], 2/6 [33%], and 0/6 [0%] studies for predictive, prognostic, and serial testing, respectively. Various other intermediate outcomes were reported by 16/31 (52%) studies for predictive testing, 3/6 (50%) studies for prognostic testing, and 3/6 (50%) studies for serial testing. Especially within predictive testing, a wide range of short-term outcomes were identified (Fig. 2).
Fig. 2Reported intermediate outcomes for predictive testing cost-effectiveness analyses. *Accuracy-related outcomes include true positives and negatives, false positives and negatives, suboptimal received treatments (treatments based on false negatives and false positives) and correct treatment decisions. RCT randomized controlled trial
3.2.2.2 Cost-Effectiveness RatiosMost studies included cost-effectiveness ratios for long-term outcomes (24/31 for predictive testing, 6/6 for prognostic testing, 6/6 for serial testing) (e.g. cost/QALY). Few studies also reported cost-efficiency ratios for intermediate outcomes (i.e. related to the impact of the test only) (5/31 [16%] for predictive testing, 0/6 [0%] for prognostic testing, 0/6 [0%] for serial testing).
3.3 Lessons Learned and Observations From the Scoping Review3.3.1 Lessons Learned Regarding Model Assumptions and UncertaintyObservation 1: Most studies utilized different evidence sources for the input of test and treatment parameters.
The most robust evidence for the clinical utility of a biomarker and the subsequent treatment (decision) can be obtained through RCTs. Double randomized RCTs, in which patients undergo two levels of randomization: (1) an initial randomization to the test, and (2) a subsequent randomization within each arm to the subsequent treatment based on the biomarker result, is the ideal trial design that allows for the evaluation of both the test and the treatment [5]. However, such trials are challenging to perform due to practical and sometimes ethical concerns. This is reflected in our scoping review, where none of the studies used evidence from a double-randomized RCT, and most used different evidence sources for the test and treatment parameters. Of the three studies that used a single source for these parameters, two relied on RWD.
Studies using an end-to-end source for test and treatment often combined these into a single test-treatment parameter, as can be seen in Steuten et al. [51] and Loubière et al. [39]. As a consequence, these studies did not require assumptions to link test outcomes to treatment effects. In addition, if a study uses a single evidence source for both test and treatment parameters, the data is derived from the same population, avoiding bias that can occur when linking multiple sources of potentially different populations. An advantage of using real-world test and treatment data is that it better reflects the real clinical pathway, and implicitly includes other relevant testing aspects, such as the timing of testing or test adherence. However, studies using a single source, particularly when this concerns RWD, also have several drawbacks: RWD tends to be more susceptible to bias, comparator data can be more difficult to obtain, and there is less flexibility to evaluate multiple testing strategies or conduct extensive sensitivity analyses.
Using different evidence sources for test and treatment parameters enables more stepwise modeling of all clinical actions and greater flexibility in the analysis, allowing for the evaluation of a broader range of strategies and more sensitivity and scenario analyses. While robust cost-effectiveness analyses can be conducted using multiple data sources, researchers should remain aware of potential pitfalls and implications of linking evidence. The following observations, lessons learned, and recommendations are particularly relevant for studies utilizing different data sources.
During the round table discussion, experts indicated that the use of a different patient population may result in a different test performance and/or treatment efficacy, thereby introducing bias in the cost-effectiveness analysis. Therefore, the first recommendation is that ‘the intended population and biomarker test application in the economic evaluation should align with the evidence sources’ (Recommendation 1, Table 3). To clarify, if the cost effectiveness of a biomarker test X that identifies biomarker Y in a patient population Z is evaluated, test parameters should be informed by evidence in which biomarker test X is used to identify biomarker Y in patient population Z.
Observation 2: Test performance is included in most studies but expressed in different parameters across biomarker applications, and the relationship between multiple tests is not always considered.
Table 3 Proposed recommendations for cost-effectiveness analysis for biomarker testsWhen evaluating the cost effectiveness of different testing strategies, the key differences between the tests lie in how well they identify a target (predictive testing), high-risk patients (prognostic testing), or disease recurrence/progression (serial testing). In our scoping review, we found that most studies included a parameter for test performance. In predictive and serial testing, this was primarily incorporated as sensitivity and/or specificity, while in prognostic testing, this was incorporated as the difference in recurrence risk between prognostic subgroups. If test performance was not incorporated, we observed that evidence linkage was simplified by assuming a 100% test accuracy. For example, Simons et al. compared different testing strategies using only the prevalence of alterations [49, 50]. Omitting test performance in their model may have contributed to similar identified alteration rates across the compared testing strategies.
Explicitly incorporating test performance allows for a more accurate comparison of testing strategies and their characteristics, as demonstrated in the study by Hofmarcher et al. [36]. They accounted for differences in sensitivity and specificity between biomarker tests, with a notable difference in specificity between PCR (86%) and next-generation sequencing (NGS) (100%). This contributed to improved treatment allocation in the NGS-based testing strategy. To avoid oversimplification when evaluating testing strategies, we propose to ‘explicitly consider test performance in cost-effectiveness analysis’ (Recommendation 2, Table 3). This enhances the comparison of test strategies, enables modeling of downstream consequences of inaccurate or suboptimal test results, and allows for the reporting of intermediate outcomes related to the test.
The majority of studies in predictive and serial testing evaluated strategies that involved a combination of tests. When multiple tests are conducted, their results may be interdependent. For predictive testing, most studies dealt with this by assuming that mutations were mutually exclusive. Only one study included a source containing evidence on the likelihood of co-occurrence of multiple targets [41]. In serial testing, little consideration was given to the correlation between outcomes of tests performed in parallel or in sequence, while this is particularly relevant in this context, because follow-up programs often include a variety of tests. Therefore, we recommend to ‘consider the interdependency between different tests at the same or at sequential time points, and explicitly report the underlying assumptions’ (Recommendation 3, Table 3).
Observation 3: Most studies that included the test performance analyzed its impact through sensitivity analyses, whereas only approximately half of the studies varied the cost of testing.
In the studies that performed sensitivity analysis for either or both test performance and test costs, the influence of these parameters seemed to vary between clinical applications and patient populations. In predictive testing, test performance and costs were often not among the most influential factors, as test costs were typically overshadowed by expensive (targeted) treatments. In prognostic testing, varying test performance had limited impact in the studies. Two studies demonstrated that varying the costs of testing impacted their conclusions, changing the preferred strategy [57, 58]. In serial testing, the study from Wanis et al. showed that varying the test performance affected the preferred testing strategy [66]. The impact of test costs was not explored in any of the studies on serial testing, despite the fact that this impact is multiplied over time in this biomarker application due to the repetitive nature of testing.
The evolving landscape of biomarker applications can result in advances in technologies improving the test performance and decreasing costs over time. Performing sensitivity analyses for these parameters is therefore highly informative, guiding future research and further test development. Several examples illustrated how these analyses contributed to the robustness of cost-effectiveness analyses. Therefore, we propose to ‘explore the impact of specifically the test costs and the test performance in sensitivity analyses’ (Recommendation 4, Table 3).
Observation 4: Most studies assumed a perfect adherence between test results and subsequent clinical decisions.
In our scoping review, we observed that the impact of suboptimal adherence to the test result was only considered in a minority of studies in either the base-case or sensitivity analysis, while this can also significantly impact the results of cost-effectiveness studies. To illustrate, we observed in the study of Jongeneel et al. that the preferred strategy changed in the sensitivity analysis in which they assumed real-world adherence compared with perfect adherence [59]. Data for the sensitivity analysis reflecting a scenario for real-word adherence cannot be obtained from RCTs, thus was generally obtained from RWD sources [59, 60], or expert opinion [48]. Considering that clinical practice does not perfectly adhere to guidelines and/or test results, we propose to ‘explore the impact of (suboptimal) adherence to the test results through sensitivity analyses’ (Recommendation 5, Table 3).
Observation 5: A minority of the included studies explored different treatment effects in different biomarker subgroups.
In the scoping review, we observed that studies using different sources for test and treatment parameters did not always explicitly consider that different biomarker subgroups can respond differently to the same treatment. For predictive and serial testing, this implies that patients with true- or false-positive (or negative) test results (may) exhibit a different response to the same treatment. Within predictive testing, evidence to inform the false-positive biomarker subgroup was often lacking, which multiple studies solved by assuming a treatment effect equal to best supportive care in these patients. One study informed the effectiveness of treatment in false-positive patients based on an RCT that evaluated targeted treatment in both wild-type and mutation-positive patients [29]. In this study, false positives, in patients who were assumed to have a treatment effect observed in wild-type patients, had a substantial impact on the overall survival (OS) and led to high additional costs due to misclassified patients. The application of adjusted treatment responses in serial testing can be illustrated by the work of Wanis et al., where false negatives led to missed diagnoses and delayed detections [66]. In addition, false positives led to extra costs for diagnostic workup. Conversely, Gazelle et al. included test sensitivity, but not specificity in their analysis, which limited their ability to account for the effects of false positives [63].
Prognostic biomarker tests stratify patients into subgroups by differentiating between high and low risk for recurrence. Two studies acknowledged that high-quality evidence informing the effectiveness of treatments in differentiated prognostic subgroups was not (yet) available for their prognostic biomarker tests [55, 60]. They both emphasize the role of prospective trials to examine whether a prognostic biomarker also has predictive value, indicating a different treatment effect in low- and high-risk subgroups. When such trials have not yet been performed, it can be worthwhile to explore the impact of a potential predictive value of prognostic biomarkers. These two studies explored the scenario in which high-risk patients responded better to treatment compared with low-risk patients [55, 60], which is beneficial for the prognostic biomarker of interest. On the other hand, Jongeneel et al. explored the impact of the biomarker-identified high-risk group being resistant to treatment [58]. This sensitivity analysis showed that an alternative testing strategy would be preferred in this situation. Therefore, we propose ‘to consider potential differences in treatment effects for different biomarker subgroups’ (Recommendation 6, Table 3). Note that the inclusion of test performance is a requirement for studies to incorporate these different treatment effects, as otherwise the biomarker subgroups cannot be differentiated.
3.3.2 Lessons Learned Regarding the Reported OutcomesObservation 6: 67% of included studies reported intermediate outcomes.
Model-based cost-effectiveness analyses can provide long-term outcomes such as the total costs, life-years or QALYs, which are often seen some of the most important outcomes for decision makers. However, included studies across biomarker applications solely reporting these long-term outcomes provided limited insight into the underlying mechanisms driving these outcomes. Wolff et al. reported both long-term and intermediate outcomes, demonstrating that the intermediate outcomes revealed complementary insights [53]. While the long-term outcomes indicated a modest health benefit at higher costs, the intermediate outcomes showed a substantial increase in the number of patients receiving a diagnostically correct treatment, along with reductions in turnaround time, test costs, and the number of unsuccessful tests. Thus, while demonstrating that the increase in costs was driven by treatment costs only, they highlighted the importance of reporting intermediate outcomes to better understand the mechanisms that play a role.
In the scoping review, a variety of intermediate clinical outcomes were identified. We classified them into three distinct types of outcomes (performance, efficiency, opportunity) and complemented the identified outcomes with suggestions from the experts (Table 4). Intermediate outcomes related to test performance and costs of testing can and should always be reported, but the specific outcomes depend on the biomarker application and the input parameters. For example, the quantification of the number of false test results showing the impact of the (in)accuracy of a test is only possible when sensitivity and specificity are included. Furthermore, depending on the aim of the cost-effectiveness analysis and the clinical setting, other intermediate outcomes can be relevant to report, such as the efficiency of laboratory procedures or new opportunities (e.g. clinical trial enrollment). To illustrate, four studies on predictive testing reported intermediate outcomes related to time (e.g. turnaround time of the test or time to treatment) [25, 27, 30, 53]. This can be particularly informative and relevant for institutional decision makers when they have to deal with time or capacity constraints. Note that modelers should not only report positive intermediate outcomes, but should also report negative ones as well. To provide a better understanding into the mechanisms that play a role in the cost effectivenes
Comments (0)