mi-Mic: a novel multi-layer statistical test for microbiota-disease associations

Challenges in difference analysis of microbiome

As mentioned, a plethora of approaches were proposed for difference analysis in microbiome [11,12,13,14,15, 17], each striving to identify meaningful taxonomic variations. Here, we focus on the six most widely-used and well-cited methods (defined by the number of citations within 2023 in over 30,000 published manuscripts), namely LEfSe, ANCOM, ANCOM-BC2, DeSeq2, ALDEx2, and the innovative LINDA (Fig. 1A).

Given the absence of a definitive GT, we cannot use true positives (TP) and false positives (FP) to compare methods. Instead, we compared the “real positives - RP”, the number of significant taxa in the observed data with the real labels, to the “shuffled positives - SP” — the number of significant taxa when the test is applied on the same microbial data, but shuffled labels. We compared 20 diverse datasets comprising both 16S (12 datasets) and WGS data (8 datasets) (Fig. 1B and Additional file 1: Table S2). Most, but not all popular approaches exhibited higher numbers of RP than SP (Fig. 1B). However, in most methods, there was a very high number of SP, and often, a similar order of RP and SP (diagonal in Fig. 1B). Applying multiple measurement corrections does not improve the RP to SP ratio, since the reduction in SP often comes at the expense of a drastic decrease in RP, or in some cases, has a minimal impact on SP (Fig. 1B, darker colors).

The second potential error is the taxa independence assumption. To address this concern, we define sisters as taxa having the same “mother” in the cladogram (e.g., Bifidobacterium (s) and adolescentis (s) are sisters that share the same mother Bifidobacterium (g)). Approximately half of the datasets demonstrated significant sister-label relations, as defined by the Spearman correlation coefficient (SCC) between the normalized frequency, at a significance level of \(p < 0.05\) (Fig. 1C and Methods for details).

miMic: a phylogeny-informed approach to address microbiome difference analysis challenges

To address the challenges above, we propose a comprehensive 3-step approach named miMic. miMic incorporates phylogeny information to enhance the accuracy and reliability of microbiome difference abundance analysis. We outline the key steps of miMic and its potential benefits, before showing the validity of the assumptions underlying and its high accuracy on analytical models, simulations, and real-world data.

As the first step, miMic preprocesses the microbiome frequencies using MIPMLP [41] (see Methods) and translates them into a cladogram of means using the iMic algorithm [42] (Fig. 1D data processing). This cladogram captures the taxonomic relationships within the microbiome data, providing valuable insights into the underlying structure.

The second step is an a priori nested ANOVA test to the microbiome cladograms and the labels to test for a relation between the label and the microbiota (Fig. 1D a priori nested ANOVA). If no significant relation is identified at any taxonomy level, the label is deemed “not microbially explainable,” and no further difference analysis is performed. If the label is continuous, the ANOVA is replaced by the appropriate GLM.

As a final step, a post hoc Mann-Whitney test is applied that leverages phylogeny information to conduct the difference analysis. Starting with the first taxonomy level, a Mann-Whitney (or Spearman correlation) test is applied to the values of each taxon, with a predetermined significance (currently \(p<0.05\)). Multiple measurement corrections are employed, but only for the number of taxa at this level. If significant taxa are detected, the test proceeds iteratively along the cladogram towards the leaves (Fig. 1D post hoc Mann-Whitney test). Notably, no multiple measurement correction is applied beyond the first level due to the stringent demand for significance along the entire cladogram trajectory. At each such level, the number of candidate taxa is low. If no significant taxa are found, we repeat the analysis starting with a finer taxonomy level (until the class level). To account for inner sister relations, we employ FDR when multiple significant sisters are present. When one sister and its significant sister share the same “mother” in the cladogram, FDR is applied to the p-values of all sisters, except the most significant one. To handle rare species, miMic also defines significant leaves following multiple measurement corrections as significant, even if they have no significant ancestor. Since, previously we have shown that those contribute practically no shuffled positives.

By integrating phylogeny information, miMic offers several advantages over existing approaches. The requirement for significance along the cladogram trajectory effectively reduces the false discoveries rate. Additionally, miMic’s consideration of inner sister relations ensures more reliable detection of real positive associations without drastically decreasing their number. Since sister taxa are correlated, there is a high probability that if a taxon is associated with a label, so will be the average with its sisters (mother taxon), as shall be proven.

Validation of miMics’ assumptions on analytical models, simulations, and real-world dataAnalytical models

To understand the logic behind miMic, assume three sister taxa with a common mother with their average in each sample. Further, assume that for each class, the distribution of each sister is normal and that with no loss of generality, the average of all sisters in the first class is 0, and the variance is 1. Each daughter taxa is assigned with a value \(x_\) and the mother taxon is assigned: \(m_j=\sum _i x_/3\). Since we focus on the distribution, we denote for a generic taxon in a generic sample \(m=m_j,z_i=x_\).

One can study three simplified regimes.

(0,0,0) — where there is no difference between the classes, and the average of all sisters in the second class is also 0. \(z1,z2,z3\sim \mathcal (0,\,1)\). In such a case, neither the sisters nor the mother should be significant.

(0,0,\(\varvec\)) — The last sister has an average of \(\mu\) in the second class and is thus associated with the class. \(z1\sim \mathcal (\mu ,\,1)\) and \(z2,z3\sim \mathcal (0,\,1)\). We would expect both the mother and the last sister to be significant, but none of the others.

(0,\(\varvec\),\(\varvec\)) — The two last sisters have a difference between the first and second class, with varying strengths. \(z3 \sim \mathcal (0,\,1)\), \(z1 \sim \mathcal (\mu ,\,1)\), and \(z2\sim \mathcal (\alpha \mu ,\,1)\). In this case, one would expect if \(\alpha >0\), both cases would be easily detected, but if \(\alpha <0\), we may miss them, since their mother may lose the correlation with the label.

In this simplified model, one can analytically show (see Appendix and Fig. 2A) that if the sisters are not associated with the class \(\sim \mathcal (0,1)\), then their mother distribution is (\(\sim \mathcal (0,\frac3)}))\). As such, the probability that it will be observed as significant is even lower than the one of the daughter (formally, this is equivalent to having \(6S-2\) degrees of freedom (DOF) and doing a test with \(2S-2\) DOF, where S is the number of samples). Moreover, numerical results show that requiring a p level significance on both the mother and the daughter is approximately equivalent to a \(p^2\) requirement on the daughter. As such, the fraction of SP cases is much lower than p (see Fig. 2B).

Fig. 2figure 2

Validation of miMics’ assumptions on analytical models and simulations. A Daughter’s distribution (pink) vs. mother’s distribution (dark pink) in the regime of (0,0,0). The mother’s distribution is noticeably narrower, with approximately half that of the daughter’s distribution. BD Comparison of the leaf confidence with miMic’s confidence based on analytical integral calculations over 3 different regimes: regime (0,0,0) (B), regime (0,0,\(\mu\)) (C), regime (0,\(\mu\),\(\alpha *\mu\)) (D). The lines represent estimated slopes, denoted as S. In D, different colors represent varying levels of connection between the sisters, controlled by \(\alpha\) values (0.25, 0.5, 1, 2, 4). E Histogram illustrating the distribution of inner sisters-label SCCs across different cohorts. The black line represents the zero line, and the dashed pink line represents the average of the distribution, indicating a right-skewed distribution, with most sisters showing a consistent positive correlation with the label. FG A comparison between the number of samples and the number of FP for miMic and Mann-Whitney leaf simulations based on the regime of (0, 0, 0) (F) and the regime of (0, 0, \(\mu\)), where \(\mu\) was set to 1. The lightest pink color represents the Leaf-C model, the middle pink represents the Leaf model, and the darkest pink represents the miMic model. H Comparison between the number of samples and the number of TP for miMic and Leaf Mann-Whitney simulations based on the regime of (0, 0, \(\mu\)). The color coding is similar to the color coding of FG. IJ Comparison between the FP (I) and TP (J) of the miMic model and the Leaf model in the simulation of the regime (0, 0, \(\mu\)) over different numbers of samples and different values of \(\mu\).The Leaf model shows a higher number of FP compared to miMic, while miMic’s TP taxa are similar to Leaf’s TP taxa. KL Comparison between the FP (K) and TP taxa (L) of the miMic model and the Leaf model in the simulation of the regime (0, \(\mu\), \(\alpha *\mu\)) over different numbers of samples and different values of \(\alpha\). The Leaf model exhibits a higher number of FP compared to miMic, while miMic’s TP are similar to the Leaf’s TP. M The F1 scores of different differential abundance methods are depicted across three distinct setups of microbiome-oriented simulations. miMic is again pink and is much higher than all current methods, many of which have an F1 of 0

In contrast, when the daughter is associated with the class, then with a very high probability the mother is also associated with the class (except for the cases, where the association of the daughter is marginal, and then the addition of the \(\sim \mathcal (0,2)\) from its sisters masks the signal). Thus, performing both tests is almost equivalent to performing only the daughter test. As such the fraction of RP cases is practically not affected if \(\alpha > 0\) (Fig. 2C, D).

This still leaves the problem of the other sisters. If one sister is associated with the label, so will their mother with a high probability. As such, the fraction of SP for those may be high. To prevent this case, we apply a multiple measurement correction in this specific case (Fig. 2D).

Finally, the case where miMic may fail is when one sister is positively associated with the class, and the other is negatively associated with the class. However, the fraction of such cases is very small (Fig. 2E). Moreover, to handle such cases, miMic also includes significant leaves following a multiple measurement correction.

Simulations Generic hirerachial simulations

We further validated our analytical analysis using simulations designed to mimic the hierarchical structure of our analytical model. These simulations were the counterpart of the analytical models above. In each scenario, we assessed the performance of miMic against two leaf-level Mann-Whitney tests: one with corrections for multiple measurements (referred to as “Leaf-C” and displayed in the lightest pink) and the other without corrections (referred to as “Leaf” and presented in the middle pink). Since those are simulations, we know the GT. Thus, the evaluation criteria encompassed the identification of both FP and TP, alongside the computation of theoretical probabilities of significance.

The simulations were executed across a range of sample sizes (N varying from 20 to 1280) and \(\mu =0.1-1\). Much like our analytical analysis, the simulation results consistently underscore miMic’s distinct advantages. Notably, miMic consistently demonstrated a lower number of FP while maintaining a comparable TP rate across all scenarios (note that those are simulations, and we know the GT). miMic consistently outperformed the two Mann-Whitney leaf tests (as illustrated in Fig. 2F, G (\(\mu\) =1), I, and K). As expected from the analytical models, miMic’s ability to control the false positive rate did not compromise its capacity for true positive detection when compared to the Mann-Whitney leaf tests (Leaf and Leaf-C) under the same simulated labels and hyperparameters (as depicted in Fig. 2H (\(\mu\) =1), J, and L).

Realistic microbiome-based simulations

To further assess miMic against existing methods, we conducted simulations using three diverse real microbial datasets with varying sample sizes (PRJNA353587, n = 83; IBD, n = 257; ERP020401, n = 684), we aimed to comprehensively capture the characteristics of genuine microbiome data. We randomly selected 10 taxa along with all their ASVs, elevating their abundances by 20% in samples randomly labeled as positive. A parallel process was executed for another set of 10 taxa linked to samples randomly labeled as negative. Sample labeling was independent for each sample and with equal probability for positive and negative. We computed the F1 score for each model (Fig. 2M). Again miMic has a higher F1 score than all current methods.

Real-world cases

To show that miMic is indeed much more accurate than the current state-of-the-art (SOTA) methods, we compared it with the most popular SOTA - LEfSe, DeSeq2, ANCOM, ANCOM-BC2, ALDEx2, and LINDA over 20 different diverse datasets comprising both 16S (12 datasets, see Additional file 1: Table S2) and WGS data (8 datasets, see Additional file 1: Table S2). As mentioned, an inherent challenge when assessing differential analysis over real-world cases is the absence of GT information (illustrated in Fig. 3A). Therefore, most of the standard metrics of Precision, such as Area Under the Sensitivity-Specficity Curve (AUC), Recall, and F1 score cannot be computed without assumptions on the distribution. To address that, we define the \(RSP(\beta )\) score (Fig. 3A) as \((\beta \cdot RP -SP)/ (\beta \cdot RP+SP)\), where \(\beta\) represents the importance of the RP vs. the SP. When \(\beta =1\), there is an equal emphasis on RP and SP. In contrast, \(\beta =0.05\) implies a willingness to forgo 20 RP to avoid 1 SP. This representation allows for tuning the type I and II errors of the analysis, in contrast with the permutation evaluation methods that focus solely on minimizing SP. The RSP of miMic is significantly higher \((p-value < 0.05)\) than the RSP of the SOTA models in 16S datasets and WGS datasets (Fig. 3B–C pink vs. all the other colors, the significance was tested for \(\beta\) = 0.1, 0.5, and 1, see Additional file 1: Table S3).

Fig. 3figure 3

Validation of miMic vs. SOTA models on real-world datasets. A RSP(\(\beta\)) schematic explanation. The tasks in the differential analysis field can be divided into 2 main types: (1) tasks with a predefined GT, where False and True labels are clearly defined (upper left scheme), and (2) tasks without a predefined GT, as is common in most real-world datasets (upper right scheme). In the second scenario, “True” and “False” are replaced with “Real” (calculated on real labels) and “Shuffled” (calculated on shuffled labels). The RSP(\(\beta\)) score is defined by \((\beta \cdot RP -SP)/ (\beta \cdot RP+SP))\), where \(\beta\) representing the user-defined weighting of Real Positives (RP) vs. Shuffled Positives (SP). BC Comparative analysis of different differential analysis methods as a function of RSP(\(\beta\)) over 16S cohorts (B) and WGS cohorts (C). Each color represents a specific model: orange for DeSeq2 (light without FDR correction and dark with FDR correction, denoted as DeSeq2-C), yellow for LefSe, green for ANCOM (light without FDR correction and dark with FDR correction, referred to as ANCOM-C), blue for LINDA (light without FDR correction and dark with FDR correction, denoted as LINDA-C), brown for ada-ANCOM, and pink for miMic (pink for log Sub-PCA MIPMLP preprocessing and purple for relative mean MIPMLP preprocessing). Each line illustrates the average RSP(\(\beta\)) score across all cohorts (12 16S in B and 8 WGS in C). The light shadows surrounding each line represent standard errors calculated over 10 simulations of the shuffled models across all cohorts (12 16S in (B) and 8 WGS in (C)). DE Comparison of different starting taxonomy levels of the miMic test and their corresponding RSP(\(\beta\)) scores over 16S cohorts (D) and WGS cohorts (E). Each taxonomy level is indicated by a different line style (1 for kingdom, 2 for phylum, 3 for class, 4 for order, 5 for family, 6 for genus, and 7 for species). Typically, the best RSP scores are achieved in the first two taxonomy levels. The inner bar plot presents the number of cohorts in which miMic is deemed significant when commencing the Mann-Whitney test at each taxonomy level. F Scatter plot of the sister’s-labels SCC vs. the RSP(1) score. A significant positive correlation of 0.588 is observed between the SCCs of sister labels and the model’s performance

Note that in additional datasets (not shown here) where the a priori nested ANOVA was not significant in any of the taxonomy levels, the post hoc test did not find any significant taxa. This emphasizes the importance of the a priori nested ANOVA to avoid useless taxa-specific tests.

Moreover, the cladogram structure is crucial. miMic obtains the best RSP scores when it starts on one of the 2 first taxonomy levels (kingdom or phylum, Fig. 3D-E). Interestingly, we find a clear positive correlation between the inner sisters’-label correlation (as defined in the “Methods” section) and the RSP(1) score (SCC = 0.588, p-value = 0.006, Fig. 3F). Only when there are practically no correlations between sisters is the RSP low. This finding underscores the importance of considering the correlation structure among taxa in the context of differential analysis. In contrast for high sister correlations, RSP is almost always 1 (Fig. 3B, C and Additional file 1: Fig. S2).

miMic’s consistency and robustness

We next investigated the overlap in significant taxa across tools within each dataset [43]. While this is not an absolute measure of accuracy, consistency implies that the assumptions made in the analysis have a limited effect on the results. If two methods with different assumptions produce consistent results, one can assume that the results are not strongly affected by the assumptions. miMic is more consistent with other models than the average of all models (Fig. 4A and Additional file 1: Fig. S3 — where miMics’ distribution of overlap is more skewed to higher overlap values than the average overlap distribution of the other models).

Fig. 4figure 4

Consistency and robustness analysis of miMic. A Within-study differential abundance consistency analysis across multiple tools. The percentage of total significant features is plotted against the number of tools that identified the feature as significant. Results are shown for the miMic model (pink) and the average of all SOTA models (white). Refer to Additional file 1: Fig. S3 for a detailed analysis of all 13 tools. The total number of significant features identified by each tool is provided in the legend. miMic demonstrates slightly higher consistency compared to the average of all SOTA models. BC Cross-study consistency analysis of differential abundance. The percentage of significant species is plotted against the number of studies where each species was identified as significant, conducted on five IBD cohorts. Results for miMic are depicted in pink (B), while those for the average SOTA model are shown in white (C). The expected results are presented in black (see the “Methods” section). Additionally, a parallel analysis on shuffled labels is provided for the ANCOM-BC2 model (green) within C. The models’ performance exceeds that of the expected random model. However, certain tools, such as ANCOM-BC2, exhibit artificially consistent results, as indicated in the inner plot of (C). For a comprehensive analysis of all 13 tools, refer to Supplementary Material Fig. S4. D Sensitivity robustness assessment. The heatmap illustrates the SCCs between each generic dataset characteristic and the percentage of significant taxa identified by each tool per dataset. Positive correlations are depicted in red, while negative correlations are shown in blue. Stars indicate a significant correlation (p-value < 0.05). miMic demonstrates robustness across all tested generic features in 16S datasets. For parallel analyses conducted on 16S and WGS cohorts, detailing the percentage of significant taxa identified by each tool per dataset and RSP score, refer to Supplementary Material Fig. S5

We next investigated the consistency of miMic (vs. other models) across datasets of the same disease. We specifically focused on IBD as a phenotype, which has been shown to exhibit a strong effect on the microbiome and to be relatively reproducible across studies [44, 45]. We acquired five datasets for this analysis representing the microbiome of individuals with IBD compared with individuals without IBD (see the “Methods” section). We ran all differential abundance analysis tools on each individual dataset and restricted our analyses to the 81 species found across all datasets. Tools that generally identify more species as significant are accordingly more likely to identify species as consistently significant compared with tools with fewer significant hits. We thus compared the observed distribution against the expected distribution given random labels. miMic demonstrates notable consistency in findings across different cohorts, with over 10% of significant species consistent within three different cohorts (Fig. 4B). While other models show similar behavior on average, miMic stands out for its lack of false consistent species. Notably, all tools exhibit significantly higher consistency than random expectation across these datasets (Fig. 4B and C and Additional file 1: Fig. S4).

Another limitation of a statistical tool may be the effect of technical aspects of the sample on the fraction of real-label positives and shuffled positives in different datasets. We analyzed the correlation between multiple characteristics such as read depth, sparsity, richness, and dataset size and the percentage of significant taxa identified by each tool per dataset. In contrast with most other methods, we found no significant correlations of any of these factors with the miMics’ percentage of significant taxa (Fig. 4D). However, as expected, we do find in WGS correlations of the richness with the number of positive results (since there are more species tested — Additional file 1: Fig. S5).

Example of miMic application and code availability

miMic is readily accessible through PyPi via https://pypi.org/project/mimic-da/ [46]. We here follow the utilization of miMic using an IBD cohort [47] as an illustrative example. The same example is given with detailed commands and formats in the Additional file 1. First, the ASVs undergo preprocessing via the MIPMLP pipeline employing default parameters (taxonomy level = 7, taxonomy group = Sub-PCA, normalization = log, epsilon = 0.1). Subsequently, 2D images are generated using iMic.

An a priori nested ANOVA test is executed, and the p-value for each taxonomy level test is presented until significance is identified. In cases where there are no taxonomy levels exhibiting a significant correlation with the tag, it implies a lack of discernible relationship between the microbiome data and the targeted label. In such cases, there is no point in advancing to the following stages. Upon detecting significance at any level in the previous step, the Mann-Whitney test is further applied. In contrast, if the ANOVA is significant, miMic performs the Mann-Whitney test along trajectories from the broadest significant level.

miMic offers a comprehensive suite of data visualization tools that provide a holistic perspective on the cohort’s differential analysis:

Cladogram (Fig. 5): This visual representation allows users to delve into significant Mann-Whitney scores and their associated p-values. It facilitates the understanding of complex relationships among taxa within the context of the label.

Fig. 5figure 5

Differential abundance analysis results visualized on a cladogram for the IBD cohort. Each color represents the sign of the Mann-Whitney score (blue for positive scores, red for negative scores, and gray for non-significant taxa in internal nodes). The node size corresponds to -log10(p-value) from the Mann-Whitney test in miMic. The node shape represents its origin of significance: spheres were identified by both miMic and the Mann-Whiteny test on leaves, circles were identified by miMic only, and squares were identified by only the Mann-Whitney test. The colors represent the taxonomic family of each node

Taxa analysis (Fig. 6A): This plot unveils critical insights, including the counts of RP (blue bars) and SP (red bars) across various starting taxonomy levels of the trajectories of the Mann-Whitney test. Additionally, it provides important information on the significance of the ANOVA starting at each taxonomy level. Note the nested ANOVA stops once one of the taxonomy levels is significant. Therefore, the taxonomy levels used during the “test” mode have a gray background. Remarkably, many cohorts exhibit a scarcity of SP.

Fig. 6figure 6

miMic’s plots - example on IBD cohort. A Bar plot illustrating the taxonomy levels in the miMic test vs. the number of significant findings in a real run (RP) shown in blue, and in a shuffled run (SP) shown in red. The highest bar plot represents the actual RP vs. SP of the selected taxonomy level of miMic combined with the leaves test as explained in the Methods. Taxonomy levels used for the a priori nested ANOVA test are shaded in gray. The number of RP significantly exceeds the number of SP. B Interaction between significant taxa found in miMic. Each taxon is colored according to its significant family color, similar to Fig. 5 above. Each node shape represents the taxon’s order. An edge is drawn between two nodes if their SCC is above 0.3 (user-adjustable) and its \(p-value < 0.05\). The width of the edge corresponds to its SCC. A blue edge represents a positive relation, while a red edge represents a negative one. C Analysis of significant positive and negative relations within taxonomic families. The y-axis displays significant families in the cohort (defined by a family that has at least 1 significant descendant), while the x-axis shows the count of positive relations within a family in blue or the count of negative relations within a family in red. Each family is colored according to its color in the interaction network in B and the cladogram of correlations in Fig. 5 above

\(\varvec\) scores (The RSP plot of the IBD cohort is not shown here, see Supp Mat. Fig. S7): As mentioned \(RSP(\beta )\) may differ among \(\beta\) values. This plot shows the \(\beta\) values where the RSP is high enough for the results to be trusted.

Inner taxa interactions (Fig. 6B): The correlation in expression between taxa, as defined by their Spearman correlation (only significant correlations are drawn). This shows a very strong association between the most significant microbes, highlighting that many of the reported associations may be the result of linkage with other microbes.

Taxa-label relations (Fig. 6C): This visual aid sheds light on both positive and negative relationships between labels and taxonomic families. It shows the families that have consistent associations with the label, while other families have some positive and some negative associations. For example, in the IBD cohort, the Lachnospiraceae family has varied relations to IBD with an equal number of significant positive associations and significant negative associations within the family. While Strepctococcaceae family is consistent with its positive associations with IBD.

Taxa distributions over different taxonomy levels Through the application of log Sub-PCA in the MIPMLP processing, taxa distributions exhibit a notable tendency toward normality. This transformation is particularly evident when examining taxonomic distributions across different levels. In the coarser taxonomy levels, such as kingdom and phylum, the distributions approach a normal distribution for broad levels (Additional file 1: Fig. S6).

Comments (0)

No login
gif