A comparison of citation-based clustering and topic modeling for science mapping

Cardiovascular research through the lens of TM

Figure 2 illustrates research landscape of CVR through the lens of TM. The plot consists of circles that indicate topics. The distance between two circles approximately signifies the degree of relatedness between topics. The size of the circles reflects the overall prevalence of each topic. The horizontal axis (PC1) in Fig. 2 represents the distinction between clinical practice and physiological research. On the left side, topics pertaining to clinical trials and surgical procedures are represented, while the right side predominantly encompasses topics related to physiological research, such as the studies of cells and tissues. The vertical axis (PC2) reflects the diagnosis or therapy as well as the molecular composition of topics. The top portion of the axis corresponds to diagnostic techniques, gradually transitioning to surgical therapy at the bottom.

Fig. 2figure 2

Figure 2 not only depicts topics in CVR but also visualizes the interconnections between topics based on semantic information. Consequently, we provide a comprehensive explanation of the map from both a content-oriented and structural standpoint. Based on the map, we categorized CVR into three main areas by examining the key terms for each topic and the topic labels (see step 2 in Sect. 3.2.1), supplemented by expert knowledge (Sievert et al., 2014): (1) Physiological Studies represented by the red category, (2) Clinical Studies & Surgical Procedures represented by the green category, and (3) Risk Factors & Diagnosis Techniques represented by the blue category. In more detail, for example, as shown in Fig. 7 in the Appendix, the term ‘risk factor’ predominantly appears in topics located in the upper-left region of the map. With expert interpretation, these topics are classified into the category Risk Factors.

Within the category of Physiological Studies, we identified nine topics: Regenerative Medicine (T2), Gene Transcription (T6), Oxidative Stress (T9), Angiotensin (T29), Genetic Research (T24), Antiplatelet Therapy (T35), Cell Studies-Cardiomyocytes (T26), Perfusion Re-injury (T37), and Cardiac Electrophysiology-Ion Channels (T38). According to the size of topics, Regenerative Medicine holds the greatest prominence among the various topics within Physiological Studies.

In Clinical Studies & Surgical Procedures, there are ten topics related to the treatment of diseases through various approaches such as clinical medication, interventional methods, or surgical procedures. To elaborate further, T1 focuses on Clinical Guidelines for Medication. T18 specifically addresses the use of Anticoagulant Medication Treatment. Within the realm of surgical procedures, T12 involves the Surgical Treatment of Congenital Heart Disease, while T39 discusses Heart Transplantation. Invasive treatments are covered in Treatment of Arrhythmia (T13), Invasive Treatment of Myocardial Infarction (T15), and Interventional Therapy (T21). Additionally, the topics of Myocardial Ischemia (T17) and Life Assist Device (T30) are included in this category. Taking into account the prevalence of these topics, it is evident that in the category of Clinical Studies & Surgical Procedures researchers are placing significant emphasis on Clinical Guidelines for Medication.

In terms of Risk Factors & Diagnosis Techniques, our focus revolves around the identification of causal factors and diagnostic indicators associated with symptoms, alongside an exploration of the diagnostic techniques. It is widely acknowledged that cardiovascular conditions stem from a combination of socio-economic, behavioral, and environmental risk factors, which we also discovered in our research. The risk factors include Hypertension (T3), Behavioral Risk Factors (T8) such as “diabetes”, “obesity”, “alcohol” and “tobacco use”, Socio-economic Risk Factors (T31) encompassing “life quality”, “anxiety”, “depression”, and “emotions”, and Cholesterol Level (T10). There is also a topic on the Ethnic Background Studies of CVR (T14), which is related to studying risk factors related to race and gender. Risk factors are also studied for Kidney Disease (T34). Regarding the diagnostic indicators of symptoms, researchers study Biomarkers (T28) and Heart Rate Variability (T32). In addition, the following diagnostic techniques are studied: Electrocardiogram (T33), Coronary Angiography (T25) and Magnetic Resonance Imaging (T16). Based on the prevalence of topics, Hypertension and Behavioral Risk Factors have garnered heightened attention. In other words, research pertaining to risk factors has received considerable emphasis in this field of study.

In addition, the TM map provides valuable insights not only into the research content of CVR studies but also depicts the interconnectedness among the three categories. Physiological Studies exhibits close associations with Risk Factors & Diagnosis Techniques, whereas it demonstrates a more distant relationship with Cardiovascular Diseases & Surgical Procedures. This observation suggests a gap between clinical research and physiological studies.

In terms of Physiological Studies, Regenerative Medicine (T2) and Oxidative Stress (T9) exhibit interconnectedness. This is due to the fact that regenerative therapies, such as stem cell therapy, possess the potential to repair and regenerate damaged tissues, thereby mitigating oxidative stress and preventing further cellular damage. Conversely, oxidative stress can also impact the efficacy of regenerative therapies. Therefore, these two topics demonstrate a strong relationship, as their interactions are mutually influential. In Risk Factors & Diagnosis Techniques, there is a close association between Heart Failure (T11), Ethnic Background Studies of CVR (T14), Behavioral Risk Factors (T8) and Hypertension (T3). This connection arises from the fact that nearly all risk factors eventually lead to heart failure. Moreover, certain diagnostic indicators of risk factors necessitate the use of measurement instruments for accurate diagnosis. Consequently, a substantial correlation is observed between risk factors and diagnostic techniques. In Clinical Studies & Surgical Procedures, topics demonstrate significant interrelationships. This can be attributed to the critical role that clinical research plays in the evaluation of the safety and efficacy of surgical procedures. Additionally, clinical studies serve as a crucial foundation for informing decisions about the best course of treatment for patients.

Cardiovascular research through the lens of CC

The CC map of CVR is presented in Fig. 3. Each circle in Fig. 3 indicates a micro-level research area, while the proximity between two circles approximately represents the degree of relatedness based on direct citation links. The color of the circles represents groups of highly related research areas, and the size of the circles reflects the number of publications in a research area.

Fig. 3figure 3

From the overall view, Fig. 3 illustrates that CC reveals three primary categories of research: (1) Physiological Studies represented by the red category, (2) Cardiovascular Diseases & Surgical Procedures & Diagnosis Techniques denoted by the green category, and 3) Risk Factors identified by the blue category.

In Physiological Studies, we identified various levels of physiological investigations, encompassing Cell Level Studies (C27, C37, C57), Gene Level Studies (C16), Hemodynamic Studies (C24, C37, C44, C51, C81,), and Ion Channel Level Studies (C56). Additionally, our analysis highlights a concentration of research endeavors in the areas of Enzyme Studies (C24), Peptide Studies (C81), and Protein Studies (C44). In the end, our analysis reveals a great emphasis on publications pertaining to Cell Level Studies (C27, C37, C57), whereas Gene Level Studies (C16), Hemodynamic Studies (C24, C37, C44, C51, C81) and Ion Channel Level Studies (C56) exhibit relatively fewer scholarly contributions.

Within the realm of Cardiovascular Diseases & Surgical Procedures & Diagnosis Techniques, CC identifies four primary cardiovascular diseases referring to the MeSH tree: Cardiovascular Abnormalities, Cardiovascular Infractions, Heart Diseases, and Vascular Diseases. Delving into further details, Heart Diseases encompass various conditions, including Atrial Fibrillation (C5), Heart Failure (C12), Cardiomyopathies (C49), and Heart Arrest (C26). Vascular Diseases encompass conditions such as Arterial Occlusive Diseases (C95), Aortic Aneurysm (C41), Embolism and Thrombosis (C17), Hypertension (C8), Pulmonary Hypertension (C36), Aneurysm Dissection (C42), Myocardial Ischemia (C9), and Varicose Veins (C59). Additionally, CC identifies two types of surgical procedures, namely Cardiac Surgical Procedures and Vascular Surgical Procedures, exemplified by interventions such as Coronary Artery Bypass Grafting (C68), Heart Valve Prosthesis Implantation (C4), and Percutaneous Coronary Intervention (C82). Moreover, this map highlights Diagnosis Techniques (C20, C21, C73) as well. Based on the aforementioned analysis and clusters’ size, it is evident that there are a great number of publications focused on cardiovascular diseases, particularly Heart Failure (C12), Atrial Fibrillation (C5), and Myocardial Ischemia (C9). This phenomenon reflects the focus of CVR on mainstream disease studies. Furthermore, the CC map reveals a multitude of clusters associated with disease, while surgical procedures and diagnostic techniques are represented by fewer clusters.

Risk Factors encompasses eight risk factors: Hypertension (C8), Mental Health (C45), Climate Change (C77), Alcohol (C66), Diabetes (C6), HIV & AIDS (C55), Nutrition & Diet (C18, C58), and High Lipoprotein (C7). These are important socio-economic, behavioral, and environmental risk factors. There is a strong concentration of publications centered around Hypertension (C8), Diabetes (C6) and High Lipoprotein (C7). Conversely, there is relatively less attention directed towards Climate Change (C77), Mental Health (C45), Alcohol (C66), and Nutrition & Diet (C18, C58). In addition to the aforementioned cardiovascular studies, CC also uncovers some small and specific clusters such as Salty Food Intake (C90), Adiponectin (C51), and Lipid Breakdown (C47).

Figure 3 illustrates that the CC map exhibits a similar relational structure to the TM map for CVR. Specifically, Physiological Studies exhibit a strong association with Risk Factors while displaying a comparatively weaker connection to Cardiovascular Diseases & Surgical Procedures & Diagnosis Techniques. Notably, Risk Factors constitute a distinct and discernible category within the CC map. Also, CC uncovers some specific clusters related to Cardiovascular Diseases, which are closely linked with Surgical Procedures and Diagnosis Techniques.

Hemodynamic Studies (C24, C44, C81), situated in the lower middle of the CC map within Physiological Studies, exhibit limited connections to other physiological studies. However, in the TM map, Hemodynamic Studies demonstrate a close association with Diagnosis Techniques. All clusters within Risk Factors are located at the top right of the CC map. Cardiovascular Diseases & Surgical Procedures & Diagnosis Techniques consist of interconnected clusters, with some clusters focusing on heart diseases (C4, C15, C25, C61, C65, C69, C107), some clusters on arterial disease (C14, C17, C41, C42, C59, C68, C86, C95), and others on venous diseases (C9, C11, C13, C28, C35, C43). This clear delineation of sub-structure highlights the categories within the domain of cardiovascular diseases.

Relations between topics and clusters

We constructed a cluster-to-topic and a topic-to-cluster mapping to further explore the relations between topics and clusters. The cluster-to-topic mapping provides the probability \(_\) of documents in cluster c belonging to topic t. Conversely, the topic-to-cluster mapping provides the probability \(_\) of documents in topic t belonging to cluster c. \(_\) and \(_\) offer different perspectives on the relatedness of topics and clusters. The consideration of both \(_\) and \(_\) enables a more comprehensive assessment of the similarity between topics and clusters.

\(_\) and \(_\) both range from 0 to 1, where values closer to 1 indicate stronger similarity, while a value of 0 implies no similarity at all between a topic and a cluster. It would be extremely challenging to analyze in full detail the similarities between all 40 topics and all 142 clusters. We therefore used a similarity threshold to simplify the investigation of relations between topics and clusters. We consider a topic t and a cluster c to be related if \(_\) or \(_\) is greater than a given threshold.

We manually reviewed the cluster-to-topic and topic-to-cluster mappings obtained using different thresholds. We utilized the igraph library in Python to visualize the mappings, as shown in Figs. 4 and 5. The figures provide insights into three distinct categories of relations: one-to-one, one-to-many and many-to-many. For the sake of clarity, unique clusters or topics are not included in the visualizations. One-to-one relations signify a single cluster corresponding to a single topic and vice versa, as demonstrated in panel a of Fig. 1. One-to-many relations refer to a single cluster that is associated with multiple topics, with each topic corresponding to only one cluster, or vice versa, as depicted in panel B of Fig. 1. Many-to-many relations involve several clusters associated with various topics, as shown in panel C of Fig. 1. In Figs. 4 and 5, blue circles represent clusters, while orange circles indicate topics. The similarity between clusters and topics is represented by numerical values.

Fig. 4figure 4

Cluster-to-topic relations for different thresholds

Fig. 5figure 5

Topic-to-cluster relations for different thresholds

As evidenced in Fig. 4, when the threshold for the cluster-to-topic mapping is set at 0.50, there are no relations between clusters and topics. With thresholds of 0.45 or 0.40, a single one-to-one relation is obtained between a cluster and a topic. Further reduction of the threshold to 0.35 reveals a combination of one-to-one and one-to-many relations between clusters and topics. Moreover, lowering the threshold to 0.30 or 0.25 uncovers additional one-to-one and one-to-many relations. When the threshold is set at 0.20, a diverse pattern emerges, including one-to-one, one-to-many and many-to-many relations. Notably, by lowering the threshold to 0.15, two larger many-to-many groups are formed. Reducing the threshold even more results in a higher density of connections between topics and clusters, predominantly characterized by many-to-many relations.

We now turn to the topic-to-cluster mapping. As evidenced in Fig. 5, there are no relations between topics and clusters when the threshold is set to 0.50. With a threshold set at 0.45 or 0.40, a single topic is linked to a single cluster. As the threshold is further lowered to 0.35, 0.30, 0.25 or 0.20, there are three, five, eight, or ten pairs of a topic and a cluster, respectively, all exhibiting a one-to-one relation. When the threshold is set to 0.15, a mix of one-to-one, one-to-many, and many-to-many relations is obtained. Setting the threshold to 0.10 leads to a further increase in the relatedness of topics and clusters. Finally, when the threshold is reduced to 0.05, one large many-to-many group emerges.

Figures 4 and 5 reveal a notable absence of strongly related topics and clusters. Only in a few exceptional cases do more than one-third of the documents in a topic pertain to the same cluster, or vice versa. Consequently, relations between topics and clusters are generally relatively weak. In most cases, the overlap of documents between topics and clusters is less than 20%.

To gain deeper insights into the nature of the relations between topics and clusters, our investigation centers on relations that surpass specific thresholds: specifically, we consider all relations for which \(_\ge 0.2\) or \(_\ge 0.1\). At these thresholds, the data reveals different types of relationships (one-to-one, one-to-many, many-to-many) in a reasonably balanced way(as illustrated in Figs. 4 and 5). Additionally, we select a higher threshold for \(_\) than for \(_\) because on average the number of documents in a topic is larger than the number of documents in a cluster. Consequently, values of \(_\) can be expected to be greater than values of \(_\).

Figure 6 shows the relations between topics and clusters obtained using the above-mentioned thresholds. To improve clarity, unique clusters or topics are not included in the figure. The figure presents three types of relations: one-to-one, one-to-many and many-to-many. One-to-one and one-to-many relations are indicative of TM and CC identifying similar intellectual structures. Conversely, many-to-many relations and unique topics or clusters reveal differences in the intellectual structures identified by TM and CC.

Fig. 6figure 6

Relations between topics and clusters

From the nature of topics and clusters, both methods identified almost the same research areas within CVR. For example, both methods identified research areas such as Cell Level Studies (C27, C32, C54, C57, C67, C108, T2), Gene Level Studies (C16, C129, T24), Biochemistry—Ion Channel Studies (C56, T37), Heart Failure (C10, C12, T11), Atrial Fibrillation–clinical studies (C5, T13), Surgical Procedures of Congenital Heart Disease (C4, C15, C42, C107, C118, C121, C124, T12), and Mental Health (C45, T31). There are two kinds of relations that reveal similarities between topics and clusters, that is one-to-one and one-to-many relations.

In terms of one-to-one relation, the similarities are easily understood. The focus of our explanation will be on how similarities are revealed in one-to-many relations, as demonstrated in panel B of Fig. 1. We observed that one cluster corresponds to several topics. For instance, C14 corresponds to T16, T25 and T33. To elaborate, C14 is associated with Medical Imaging Techniques, and includes terms such as “myocardial perfusion imaging”, “computed tomography”, “coronary computed tomography angiography”, and “coronary angiography”. Approximately 14% of the publications in Electrocardiogram (T33), 22% of the publications in Coronary Angiography (T25) and 17% of the publications in Magnetic Resonance Imaging (T16) are in Medical Imaging Techniques (C14). This implies that topics T16, T25 and T33 and cluster C14 identify a similar research area in CVR at different levels of granularity. TM provides a more refined classification compared to CC. We explored the underlying reasons contributing to the creation of one-to-many relations. CC categorizes publications that use similar materials, equipment, practical techniques, or tools used in the cited work, according to the citations of the methodological type proposed by Bornmann and Daniel (2008). As a result, CC tends to yield more generic results in the context of Medical Imaging Techniques. On the other hand, TM structures publications based on the co-occurrences of terms in similar texts (Daenekindt & Huisman, 2020). Consequently, TM distinguishes differences among diagnostic approaches employed for various diseases. In the case of Medical Imaging Techniques, several topics (e.g., T16, T25, T33) are associated with different aspects of the field. To sum up, regarding Diagnosis Techniques, clusters generated by CC provide a generic perspective on diagnostic techniques, while topics derived from TM depict specialized sub-techniques for diverse applications.

In the case of one-to-many relations, we observe oppositely corresponding relations, where one topic corresponds to multiple clusters, as illustrated in panel B of Fig. 1. For instance, T21 is associated with clusters C65, C82, C94, C109 and C119. In more detail, T21 identifies terms such as “transcatheter”, “closure”, “catheter”, “PVI”, “catheterization”, and “reintervention”. It indicates an emphasis on Interventional Treatment. Approximately 25% of the publications in Transcatheter Closure (C65), 32% of the publications in Transcatheter Closure-Pediatric research (C109), 26% of the publications in Vascular Surgery (C119), 28% of the publications in PVI (C82) and 29% of the publications in Hemodialysis Access (C94) belong to Interventional Treatment (T21). Although Transcatheter Closure (C65, C109), Hemodialysis Access (C94) and PVI (C82) all fall under the category of Interventional Treatment, their treatment objectives and principles differ. In a nutshell, regarding Interventional Treatment, topics generated by TM depict a generic perspective, while clusters obtained from CC offer a specialized classification for different treatment objects.

In terms of dissimilarities, there are two types of relations that manifest the dissimilarities. The first type is represented by a unique solution identified by either TM or CC, as depicted in panel D of Fig. 1. As an example, TM discerns the unique topics Practical Guidelines for Clinical Medication of CVR (T1), Prevention Strategies of Cardiovascular Diseases (T4) and Clinical Trial Studies (T5). We explored the reasons for the existence of unique topics. We were aware that publications on these topics are distributed over several clusters, with a primary focus on medication adherence and risk factors. Additionally, some clusters, like Nutrition & Diet (C58), Arterial Occlusive Diseases (C95), Food Chemical Elements (C101), Protein Studies of Biological Chemistry (C113), and Phlebology Studies (C132) do not have corresponding topics. We examined the characteristics of these clusters and found that they contain a limited number of CVR publications, and these publications focus on several research objectives in one cluster. In short, TM groups publications into specific topics focused on Practical Guidelines for Clinical Medication of CVR (T1) and CV Clinical Trial Studies (T5), while publications within these topics are distributed among various clusters that center on different aspects of risk factors. Furthermore, CC generates clusters of small size that are characterized by their focus on several research objectives.

Many-to-many relations highlight the dissimilarities between TM and CC. Many-to-many relations refer to situations where a singular cluster is associated with multiple topics, and each topic is linked to multiple clusters, as exemplified in panel C of Fig. 1. For instance, Rheumatic Diseases (C35) and Life Assistance Devices and Heart Transplantation (C25) display varying proportions of publications related to Life Assistance Devices (T30). Furthermore, C25 has publications that are related to both Life Assistance Devices (T30) and Heart Transplantation and Medication Adherence (T39). This makes it a connecting point for both T30 and T39. Similarly, T39 serves as a bridge linking Hypertension (C8) and Life Assistance Devices and Heart Transplantation (C25). C8 also serves as a connection point linking Heart Transplantation and Medication Adherence (T39) and Hypertension (T3). These connection points link C25, C35, T30, T39, C8, and T3 together, forming a many-to-many relation. To delve deeper, 21% of publications in cluster Life Assistance Devices and Heart Transplantation (C25) and 27% of publications in cluster Rheumatic Diseases (C35) are in Life Assistance Devices (T30). And Heart Transplantation and Medication Adherence (T39) contains 11% of the publications in Life Assistance Devices and Heart Transplantation (C25) and Hypertension (C8). Meanwhile, T3 encompasses terms such as “hypertension”, “preeclampsia”, “blood pressure”, “food”, “salt” and so on. Furthermore, “preeclampsia” is a complication of pregnancy-induced hypertension, which is one subcategory of hypertension. “Food” and “salt” are the leading causes of hypertension. It shows that T3 focuses on Hypertension. Approximately 30% of the publications in Hypertension (C8) are in T3.

In summary, TM groups publications into topics such as Hypertension (T3), Life Assistance Devices (T30) and Heart Transplantation (T39), which depict interdisciplinary connections. In contrast, CC structures publications into clusters that center on different aspects of these topics, thereby creating a distinct division between Risk Factors and Physiological Research. In addition, we discovered topics that highlight the surgical or clinical research areas of CVR, with corresponding publications distributed across clusters that focus on specific diseases. Moreover, many-to-many relations demonstrate differences in intellectual structure. The TM map exhibits a close association between Risk Factors and Diagnosis Techniques, while the CC map reveals a strong connection between Risk Factors and Diseases or Surgical Procedures.

Comments (0)

No login
gif