We examined the methodological rigour of trials evaluating apps for depression or anxiety and whether quality has improved over time. We included 176 RCTs conducted between 2011 and 2023 identified in the most recent review.2 We examined the association between publication year and 20 facets of study quality, encompassing indicators of risk of bias, participant diversity, study design features and app accessibility measures. In our primary analyses, three statistically significant associations were found, indicating an increase in trial preregistration and reporting of adverse events, and a decrease in studies reporting the app under study is available for iOS and/or Android. There was evidence that use of modern methods to handle missing data and low risk of bias related to using ITT (rather than completer) analyses was increasing in sensitivity analyses that removed three early high-quality trials. Overall, findings provide limited evidence indicative of improvements in the quality of clinical trials of mental health apps in some domains, reinforcing concerns about the state of the evidence in this field and highlighting urgent need for change.5 11 27 28
It is promising to observe an increase in the preregistration of clinical trials of mental health apps. Preregistration is crucial for helping to mitigate biases and selective reporting, ultimately enhancing the reliability, replicability and trustworthiness of findings regarding the clinical benefits of mental health apps. This trend also aligns with broader patterns of increased preregistration observed across medicine and psychology.29 30 We suspect that there may be a few reasons for this shift. Researchers may have become more aware of and receptive to open science principles, possibly driven by the replication crisis observed across numerous scientific fields.31 32 Alternatively, the increase could be driven by the growing number of high-impact scientific journals that now mandate trial preregistration as a condition of publication, or to the pressures from ethical, institutional or funding bodies encouraging preregistration to ensure the robustness and rigour of the research they support.31 Whatever the reason, this is an encouraging trend that should enhance the transparency and reproducibility of clinical trials for mental health apps.
We found evidence of an increase in the reporting of adverse events in trials of mental health apps. This finding aligns with recent calls to prioritise the documentation of risks and harms in trials of digital health therapeutics.15 33 34 This finding is also encouraging, as reporting adverse events is critical for ensuring the safety of these tools and upholding ethical standards by fostering transparency and supporting informed decision-making by researchers, clinicians and patients. It is now important for future research to understand the mechanisms behind possible adverse events, determining whether they are directly attributable to the functionality or content of the app, the use of a digital device in general, or influenced by other participant or contextual factors.15
When excluding three early high-quality studies, we found evidence that trials are handling missing data more appropriately, that is, using modern missing data methods (eg, maximum likelihood, multiple imputation) and conducting ITT (rather than completer) analyses. Given attrition is common in app RCTs,9 appropriately handling missing data is important for drawing reliable conclusions. Ideally, future studies will manage to decrease study attrition (a feature that has not shown improvement over time) as well as use modern methods for exploring the potential influence of data missing not at random (eg, pattern mixture modelling).35
Notwithstanding positive trends in preregistration, reporting of adverse events, use of modern missing data methods and use of ITT analyses, other methodological features that characterise high-quality trials have not shown evidence of increased adoption, despite ongoing calls for their implementation. For example, there have been repeated efforts to highlight persistent issues related to the (1) excessive use of waitlist controls, (2) underpowered pilot nature of many available trials, (3) high attrition rates that compromise interpretation of study findings and (4) lack of longer follow-up assessments.1 11 17 These shortcomings are likely influenced by a combination of factors. One contributing factor may be increased publication pressure, where scientists are incentivised to prioritise quantity over quality. This environment can deincentivise conducting larger, more time-intensive trials that better meet the standards of high-quality research. Another factor may relate to budgetary and funding constraints; implementing features that characterise higher quality trials requires additional resources, including money and personnel (eg, more participants, resources for active controls, participant reimbursement), which may not always be feasible, especially for smaller research teams or those operating in underfunded environments. Likewise, the budget and resources necessary to build and maintain functional digital mental health systems may be greater than most investigators realise.36 Alternatively, the heavy focus on developing and trialling new apps (rather than working with existing apps that can be customised) may encourage more exploratory pilot testing, as is typically recommended in established frameworks that set out the phases of intervention evaluation.37 If this is the case, it is possible that those apps that demonstrate feasibility will be subject to more rigorous evaluation in large-scale, confirmatory trials in future.
Another concerning finding was the lack of increased replication efforts. With the exception of a few commercially available apps like Headspace and PTSD Coach, few identified apps have been tested for efficacy across multiple settings and participant groups. This is likely because many apps tested in clinical trials are developed for research purposes and are not commercially available for broader use or independent validation from other research teams. Furthermore, the field and funders may prioritise the creation of entirely new apps over refining and augmenting existing ones with a promising evidence base. Establishing publication and funding standards to prioritise replication is essential to advancing the field. Replication efforts would strengthen trust in these tools and help identify the specific conditions under which apps are most safe and effective, paving the way for a more personalised approach to mental healthcare.
The current findings must be interpreted within the context of their limitations. First, findings regarding the lack of improvement in methodological rigour in clinical trials of apps for depression and anxiety cannot be generalised to other psychiatric conditions. There is some evidence that trials on digital health technologies in patients with schizophrenia are conducted with greater rigour, including more oversight and risk assessments, given heightened concerns for potential adverse events.15 It would be useful to investigate whether the methodological quality of digital health trials for other psychiatric conditions has improved over time. Second, our analyses were based solely on the information provided in the published report, so it is possible that certain design features may have been implemented (eg, participant payment, iOS and/or Android compatibility, etc) but were not explicitly reported. Third, although we analysed a large number of trial quality features, there are potentially many more relevant design features that we did not consider, which could be increasing over time (eg, Consolidated Standards of Reporting Trials compliance, conflict of interest declarations, adherence monitoring efforts, intervention fidelity measures, etc). Fourth, our findings reflect trends in trial quality within this growing literature and do not speak to the quality of individual trials. It is important to note that, despite the overall trends, the increased publication of trials in this field has also given rise to a number of high-quality individual studies. Arguably, these high-quality trials can be examined individually and in aggregate (eg, via meta-analysis) to make reliable inferences. Fifth, we used the publication year as a proxy for trial timing, which may not always reflect the actual year the trial was initiated. However, this decision was necessary to ensure consistency across studies, as many trials did not report their start date or have a preregistered protocol.
Another limitation of our study is that we did not include a measure of engagement as an indicator of trial quality, instead opting for study attrition (a related but distinct construct38). This decision was driven by the inconsistent reporting and definition of engagement across trials, making meaningful between-study comparisons difficult and degree of missing data significant. The lack of standardised reporting on engagement is a well-recognised issue in the field and continues to hinder progress.39 40 New methods like digital phenotyping may help provide new objective data on screen use. Recent pilot research41 suggests that the correlation between engagement as measured by digital phenotyping and as measured by self-reported scales may be minimal, suggesting each method is capturing unique data. New unified frameworks to assess engagement will also be critical.42 Beyond measurement, it is equally important for future research to focus on improving engagement with digital interventions, as greater engagement may lead to stronger clinical benefits.43 Emerging design strategies such as digital navigator coaching,44 gamification principles,45 tailored email prompts46 and just-in-time intervention strategies47 show promise in increasing engagement but require rigorous evaluation in large-scale randomised trials.
Back to top
Comments (0)