Data-driven approaches are transforming healthcare, yet acquisition of comprehensive datasets is hindered by high costs, privacy regulations, and ethical concerns. To address these challenges, synthetic data, artificially generated datasets that mimic the statistical properties of real-world data, provides a promising solution. Despite its growing adoption, the thematic landscape of synthetic data research in healthcare remains underexplored. Therefore, we applied structural topic modeling (STM) to map the research landscape of synthetic data in healthcare, revealing prevalent topics and tracking their evolution over time and across geographic locations. PubMed publications from 2000-2024 containing "synthetic data," "artificial data," or "simulated data" in the title/abstract were retrieved. After preprocessing the text (lowercasing, punctuation/stopword removal, stemming), structural topic modeling (STM) was performed using year and continent as covariates. The optimal number of topics (K=10) was determined using held-out likelihood and interpretability. Topic prevalence, temporal trends, and inter-topic correlations were analyzed using stacked area charts and network analysis. Analysis of 14,788 PubMed articles (2000-2024) revealed a tenfold increase in publications. Geographically, North America (48.6%) and Europe (33.5%) were primary contributors, but Asia's share steadily rose from 2.9% to 23.1%. STM identified ten key topics, grouped into Biomedical Imaging & Signal Processing (25.2%), Synthetic Data Applications in Biomedical Research (17.7%), Computational & Statistical Methods (23.9%), and Genomics & Evolutionary Biology (33.2%) themes. We observed gradual declines in initially prominent topics including “Bayesian Modelling” (23.1% to 9.9%), “Neuroimaging” (16.0% to 9.3%), and “Image Simulation” (17.7% to 9.1%), giving ascendancy to “Synthetic Data Generation” (2.2% to 27.1%) and “Disease Modeling and Public Health” (4.8% to 11.9%) by 2024. Synthetic data research in healthcare has experienced increasing interest, marked by shifts in geographic distribution and dynamic evolution of key topics. Realizing the full potential of synthetic data requires fostering cross-disciplinary collaborations, implementing bias mitigation strategies, and establishing equitable partnerships.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThe author(s) received no specific funding for this work.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Not Applicable
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethical approval was not required for this study as it did not involve human or animal participants.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Not Applicable
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Not Applicable
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Not Applicable
Data AvailabilityWe retrieved publicly available metadata from PubMed using a structured search for "synthetic data," "artificial data," and "simulated data" in titles and abstracts. The search strategy is detailed in the Methods section.
Comments (0)