Efficient statistical analysis of trial designs: win ratio and related approaches for composite outcomes

In evaluating the effectivity of composite outcomes with win ratio methodology, matched and unmatched analyses have been described. With the matched approach, each individual in the treatment group is paired with a single individual in the control group according to their underlying risk of two or more individual outcomes (i.e., a composite outcome). Similar to any matching technique, this methodology generally increases the statistical power of the test by making comparisons between participants with similar risks (Pocock et al. 2012). However, one drawback of this method is that not all individuals can be matched, and in consequence, a variable number of observations needs to be removed from the analysis. Hence, the matched win ratio approach has not gained wide acceptance among researchers (Redfors et al. 2020), and its use is generally not recommended for clinical interventional trials, although it may be useful in observational studies (Multiple endpoints in clinical trials: guidance for industry 2022).

In contrast, with the unmatched approach every individual in the treated group is compared with every individual in the control group for the hierarchical composite endpoint, thus including every participant in the analysis. The methodology proposed by Finkelstein and Schoenfield was later used by Pocock to further develop the unmatched approach of a new approach, which was introduced in 2012 as win ratio (Pocock et al. 2012). For the purposes of the present review, we refer to the unmatched win ratio as WR. The WR can be interpreted as an extension of Wilcoxon–Mann–Whitney test for single continuous outcomes to a more generalized test that accommodates different types of outcomes with missing data, and provides a measure effect with its confidence interval (Pocock et al. 2012, Redfors et al. 2020, Pocock et al. 2023). The key advantage of WR approach is that priority is given to the most important endpoint such as, e.g., death, instead of the first event to happen (Pocock et al. 2012).

With the aim of understanding the rationale behind the WR, we provide an example from the literature. In a randomized multicenter clinical trial, Tavares et al. conducted a trial to evaluate the efficacy of dapaglifozin, a SGLT-2 inhibitor, to improve a composite outcome involving (1) mortality, (2) use of continuous renal replacement therapy (CRRT), and (3) length of stay (LOS) in critical care patients. With this aim, a total of 507 patients (admitted with acute organ dysfunction to 22 different critical care units in Brazil) were randomized to receive 10 mg of dapaglifozin along with the standard of care (n = 248), or standard of care alone (n = 259), (Tavares et al. 2024). In analyzing the primary composite outcome with the WR, Tavares et al. reported a total of 27,143 wins and 26,929 losses from a total of 64,232 (248 × 259) pairwise comparisons, yielding a WR statistic of 27,143/26,929 = 1.01 (95% CI 0.90 to 1.13, p-value = 0.89). Therefore, the authors concluded that the addition of dapaglifozin to the standard of care of critical care patients with acute organ dysfunction did not improve the proposed clinical outcomes, after accounting for their clinical relevance (Tavares et al. 2024).

A worked example

In this section, we develop a worked example adapted from the clinical trial conducted by Tavares and colleagues (Tavares et al. 2024), with the aim of illustrating the methodology used by unmatched WR to prioritize outcomes in the setting of trial designs with composite outcomes.

For simplicity, assume that in the clinical trial above described only 8 patients were assigned to receive dapagliflozin (group = 1), whereas 12 patients did not receive this treatment (group = 0). We compared 8 against 12 patients to emphasize that the number of participants allocated to the treatment and control groups does not need to be identical to conduct a WR.

Figure 1 and the supplementary information illustrate the methodology used in this example to compare the outcomes mortality, CRRT, and LOS among 20 patients. A total of 96 pairwise comparisons are made between each patient allocated to the treatment group (dapaglifozin plus standard of care) and all patients allocated to the control group (standard of care alone). A total of 45 wins, 49 losses, and 2 ties were obtained with respect to mortality, CRRT, and LOS. The details for the statistical analysis of composite outcomes with WR are provided in Figs. 1, 2, 3, and 4 of the Supplementary information.

Fig. 1figure 1

Unmatched win ratio analysis (WR) and win difference (WD) for 20 randomized participants allocated to the treatment group (dapaglifozin plus standard of care) or control group (standard of care alone), with respect to a hierarchical composite outcome encompassing (1) mortality, (2) continuous renal replacement therapy (CRRT), and (3) length of stay in critical care (LOS). Adapted from Pocock et al. 2023

In the example illustrated in Fig. 1, the hierarchical levels for each component of the composite outcome were pre-defined in accordance with the clinical importance of the outcome. Thus, mortality was prioritized over CRRT, and these two outcomes were deemed clinically more important than LOS. This step is critical for the evaluation of composite outcomes, because the results will be driven by the hierarchical arrangement of these variables (Tavares et al. 2024).

The win difference (WD) is used as a measure to compare the treatment with standard of care in individual outcomes, and it represents the absolute difference between the number of"wins"in the treatment group and the control group. A positive WD indicates benefit with the treatment, i.e., a “win”; a negative WD indicates benefit with standard care, i.e., a “loss”; and a WD of 0 indicates no benefit with either the treatment or control, thus representing a “tie”.

The comparison between 8 participants assigned to the treatment group against 12 participants assigned to the control group produced a total of 96 pairwise comparisons, as depicted in Fig. 1. For every discordant comparison (i.e., when the outcome differed for the participant assigned to treatment group, as compared to the one assigned to the control group), a win is assigned to the treatment group if the treated individual showed a better outcome. Otherwise, it will be counted as a loss. For example, when comparing the CRRT status between two participants (one allocated to the treatment group, and one allocated to the control group), there are four possible scenarios: only the treated patient required CRRT, only the control patient required CRRT, both patients required CRRT, or none of them required CRRT. For the first two cases, the pairwise comparisons would be regarded as loss and win (i.e., discordant pairwise comparisons), respectively; the last two cases are regarded as concordant pairwise comparisons. These tied results are then carried forward for the evaluation of the next outcome, in this case LOS, and the procedure continues until all outcomes are exhausted. At this stage, the residual concordant pairwise comparisons are regarded as ties.

According to this example, when accounting for mortality outcome, there were 18 wins and 18 losses; when combining mortality and CRRT, there were 11 wins and 18 losses; and after accounting for mortality, CRRT and LOS, there were 16 wins and 13 losses. In consequence, the total number of wins and losses for a composite outcome involving mortality, CRRT and LOS was 45 and 49, respectively, whereas there were 2 final ties after all pairwise comparisons were analyzed.

It follows that the WR statistic, computed as the number of wins divided by the number of losses, is 0.92 (45/49). The interpretation is that, if the treated and control patient differ in the outcome (i.e. a discordant pair), the odds for the treatment group to do better than the control group are 0.92. Equivalently, the odds for the control group to do better than the treatment group are 1.09 (1/0.92). Stated in a different way, the probability that the participant on dapagliflozin wins is 92/(1 + 0.92) = 0.48 (Pocock et al. 2012).

The WR for this small hypothetical example suggests that dapaglifozin might not be of benefit in patients admitted with acute organ dysfunction. As expected, given the small sample size (n = 20), the level of evidence for an improvement of this composite outcome in patients prescribed dapaglifozin was poor (95% CI 0.31 to 2.71, p-value 0.88).

The methodology above described can be employed for combining binary as well as continuous, categorical and ordinal outcome measures. Moreover, one of the key features of the WR approach for clinical trials with composite outcomes is its flexibility to combine other types of outcomes (for example, involving time-to-event data, longitudinal data, or self-reported events), thereby endowing the researcher with an armament of options that accommodate to the specific requirements of any composite outcome. Further, outcomes can be analyzed from the perspective of the event occurrence (yes or no), the number of events occurred, the time elapsed until the first event occurs (time-to-event analysis), or the severity of events (Redfors et al. 2020).

In recent years, a variety of statistical software packages have been made readily available on the internet for the application of this methodology. See, for example, the WWR and WINS packages developed for the R® programming language (Qiu et al. 2017; Introduction to the R package WINS 2024) the “winratiotest” command developed for Stata® statistical software (Gregson et al. 2023), and the implementation of WR in SAS® statistical software (Dong et al. 2016). Mao et al. have also described a methodology for the calculation of sample size in win ratio analysis (Mao et al. 2022), which has recently been implemented in R® statistical software (Sample size calculation for standard win ratio test 2024).

The stages involved in the analysis of composite outcomes with unmatched WR approach are summarized in Table 2 (Pocock et al. 2012).

Table 2 Six proposed steps for the analysis of composite outcomes using the unmatched WR approach (Pocock et al. 2012)Pros of WR

Conventional analyses of composite outcomes do not account for the clinical importance of individual components, and therefore the use of alternative methods is warranted and has been proposed. The WR offers an attractive solution to this problem, thereby providing clinicians with a useful metric, which is relatively easy to compute (Pocock et al. 2012).

The key advantage of the WR and other related tests is that outcomes are prioritized in accordance with their clinical impact on individuals. For example, mortality assessed in the WR approach with the highest priority provides more weight to this specific outcome, which contrasts with mortality as an individual outcome component of a conventional composite outcome. Here, a potentially low incidence of postoperative mortality does not add much weight in comparison to higher incidences of other individual components of composite outcome variables. In addition, with this methodology, the sample size required may be smaller to achieve the same statistical power when compared to conventional approaches. This feature has been demonstrated with simulation studies, although it would depend on specific aspects of the trial and the interventions (Redfors et al. 2020; Pocock et al. 2023).

Individual outcome variables of the WR approach may include binary as well as continuous, categorical, and ordinal outcome measures. This has the advantage of facilitating a combination of clinical and patient-centred outcomes, such as organ failure (e.g., acute kidney injury) plus quality of life (e.g., days-alive -and-at-home).

Furthermore, with the advent of computer programs recently developed for a variety of statistical softwares (Qiu et al. 2017; Gregson et al. 2023; Dong et al. 2016), the estimation of confidence intervals for the WR allowing for the lack of independency of pairwise comparisons can be readily obtained as well as sample size estimations.

WR analyses may lead to gains in power, particularly with high patient heterogeneity and low rates of drug discontinuation in pharmacological trials; however, this is not guaranteed (Claggett et al. 2018).

Cons of WR

It is worth noting that the methodology described in Fig. 1 and Table 2 to calculate the WR systematically excludes tied pairwise comparisons. When the number of ties obtained is large, this approach may be seen as problematic, because estimated treatment effects could be overestimated. However, confidence intervals are typically wide (Ajufo et al. 2023). Furthermore, the reported WR may not represent the whole study population (it only involves a sub-population of patients for whom the corresponding pairwise comparison was labeled either as a win or a loss). On the other hand, additional outcome variables as part of the WR, e.g., inclusion of a quality-of-life measure, can then be helpful to clarify the wins as well as incorporating other important patient-centred outcomes and act as a “tiebreaker” (Ajufo et al. 2023).

An alternative metric, known as win odds (WO), has recently been proposed to address the problem of ignoring tied pairwise comparisons (Brunner et al. 2021). The WO is computed by adding one-half of the total number of ties to the numerator and denominator of the WR. In the example summarized in Fig. 1, the WO corresponds to (45 + 1)/(49 + 1). Thus, this quantity remains virtually unchanged as compared to the unmatched WR (45/49), because there were only 2 ties. In fact, in the absence of ties, WO reduces to WR. However, in a study with a higher number of ties, the WO can be substantially different, thus adding complexity to the interpretation.

The situation with a large number of ties can be pictured with the following hypothetical example using simulation: Consider a randomized clinical trial where 50 patients were allocated to the treatment group, and 52 were allocated to the control group. From the resulting 2,600 pairwise comparisons, there were 147 wins, 49 losses, and 2404 ties. The WR would be computed as 147/49 = 3.0 (95% CI 0.41–21.9, p-value 0.279), (Introduction to the R package WINS 2024), but importantly, 2404 comparisons would be ignored. Applying the above adjustment with tied comparisons equally allocated to each arm, the resulting score would be (1202 + 147)/(1202 + 49) = 1.08 (95% CI 0.93–1.25, p-value 0.324), (Introduction to the R package WINS 2024). Thus, some authors have recommended that WO should be reported in the presence of a high number of ties (Ajufo et al. 2023; Dong et al. 2023).

One disadvantage of WO is that the interpretation can be less intuitive, as compared to WR (Pocock et al. 2023). In addition, the analysis of composite outcomes with many ties would result in WO that favors the null hypothesis of no benefit of the proposed treatment. In consequence, in the context of non-inferiority trials in particular, the use of WO is generally not recommended (Ajufo et al. 2023).

From a clinical standpoint, the observed differences between pairwise comparisons that ultimately define winners and losers may not necessarily be of practical or clinical relevance. This limitation has led some authors to propose a winner is declared only if the pairwise difference is of a clinically relevant given size, e.g., based on a given difference in a quality-of-life scale, which is clinical meaningful, or the amount of troponin release in defining a myocardial infarction. However, this approach would be detrimental for the statistical power given the increased number of ties, and therefore, the use of margins (or clinically relevant given sizes of pairwise differences) has been discouraged by other authors (Redfors et al. 2020).

Another caveat of WR is that comparisons are often made between individuals that are not necessarily under the same risk of developing the outcome (Pocock et al. 2012; Ajufo et al. 2023). To overcome this problem, data can be stratified according to the variables influencing the risk of the composite outcome, and the stratified WR can be obtained by combining WRs across strata (Pocock et al. 2012). For example, in a randomized clinical trial, researchers evaluated the benefit of empagliflozin (another type of SGLT-2) as compared with placebo, in patients with heart failure (HF) after initial stabilization (Voors et al. 2022). The clinical benefit was defined by a composite outcome involving mortality, number of HF events, time to first HF event, and a self-reported outcome evaluating quality of life. The HF outcome was stratified into patients with acute de-novo HF and those with decompensated chronic HF. The reported WR for de novo and decompensated HF patients was 1.29 (95% CI 0.89–1.89) and 1.39 (95% CI 1.08–1.81), respectively, and the combined WR was 1.36 (95% CI 1.09–1.68), (Voors et al. 2022).

It should be noted that the WR has been conceived as a relative measure of wins and losses. An alternative approach as defined earlier is to report the WD in an additive scale, expressed in terms of percentage (i.e., % of wins minus % of losses) instead of the relative measure given by WR. In the example summarized in Fig. 1, the percentage of wins (46.9%) minus the percentage of losses (51.0%) yields a WD of -4.1%. One advantage of this approach is that it provides an absolute measure of treatment benefit, as opposed to the relative measure given by WR. Although the interpretation of WR and WD can be analogous to odds ratio and risk difference respectively, further calculations including number needed to treat (or harm) cannot be immediately inferred, because the WR only includes pairwise comparisons that were not disparate among groups (Ajufo et al. 2023). Similarly, in the setting of time-to-event analysis, WRs are comparable to hazard ratios, with the exception that tied comparisons are excluded from the analysis (Pocock et al. 2023; Ajufo et al. 2023).

Given the relative novelty of WR score, particularly outside the area of trials in cardiovascular disease, clinicians may find it challenging to translate the results into clinical practice (Pocock et al. 2012). For example, in the empagliflozin trial (Voors et al. 2022), how can a WR of 1.36 be interpreted? In this clinical trial, authors reported that the superiority of empagliflozin over placebo was mainly driven by self-reported quality of life outcomes. The differences observed for mortality and HF events, although clinically relevant, did not substantially change the overall WR. However, this information is not conveyed in a single WR score. Therefore, we advocate the use of flow charts, as the one provided in Fig. 1, when analyzing composite outcomes with WR approach, to ensure transparency in the results for readers (Pocock et al. 2023).

Lastly, clinicians should be aware that although the WR approach effectively prioritize outcomes, wins and losses are equally weighted across outcomes of differing clinical importance, thereby making this methodology in some instances inappropriate [22]. For example, if mortality has the same weight as hospitalization for HF, the WR will likely be driven by hospitalization events, because they can occur more frequently and earlier during the follow-up, despite the fact that when the outcome is associated to mortality more deaths are expected as the trial lasts longer.

Table 3 outlines the main disadvantages of WR and proposes potential solutions to address these issues, including the possibility of applying pre-defined weights to each of the components of the WR.

Table 3 Challenges encountered in the analysis and interpretation of win ratio approach, and proposed solutions to overcome these problems

Comments (0)

No login
gif