Metagenomics distilled: new k-mer-based methods

k-mer-based methods have long been used for metagenomic analysis as they quickly and efficiently identify microorganisms in large datasets and enable the estimation of the relative abundance of microbial species and the tracking of strains in complex samples. They provide a fast and efficient, albeit approximate, approach that is often able to run in minutes on a standard laptop. However, recently they have begun competing with the quality of current standard approaches, which, by contrast, are typically grounded in DNA, RNA or peptide sequence alignments. The unique strength of k-mer approaches comes from their simplicity, extracting exact fixed-length sub-sequences (k-mers) and then detecting exact matches. Thus, they do not rely on computationally expensive analysis steps such as sequence alignment or assembly of shotgun-sequenced reads, and instead use appropriate statistical models to measure biologically meaningful phenomena within the samples being studied.

In this vein, researchers recently developed a tool called sylph1 that is capable of quantifying microbial species abundance within metagenomes to an accuracy surpassing all previous standard methods. They achieved this by developing a statistical model that considers k-mer variation due to natural biological sequence variation within species as well as technical sequencing errors. The researchers then measured subspecies sequence similarity (average nucleotide identity, ANI) between each reference genome and the containing metagenome, and used this data to find disease–strain associations for Parkinson’s disease in a case–control study using a gut metagenome dataset (724 samples). Leveraging the computational efficiency of sylph, ANI-measuring analysis was scaled to test all genomes in the complete unified human gut genome catalogue (n = 289,232). The resulting set of genome–metagenome similarities was analysed comparably to genome-wide association studies to identify which variants of gut microbiome species were significantly associated with the outcome. Twenty-five genomes displayed significant associations with Parkinson’s disease. At a species level, butyrate bacterial producers (Blautia wexlerae, Agathobacter rectalis and Roseburia intestinalis) were negatively associated with Parkinson’s disease, whereas Ruthenibacterium lactatiformans — linked to low butyrate — was positively associated with the disease. Crucially, in most species, only a minority of genomes had associations with Parkinson’s disease, thus indicating the importance of subspecies effects. However, unlike standard alignment-based microbial genome-wide association study workflows, the results from sylph cannot identify specific genomic polymorphisms that can be documented, reported and subjected to follow-up analysis. To solve this problem and allow rapid analysis of genome variants, other k-mer-based tools capable of locating genomic polymorphisms have recently been developed. Methods based on split k-mer analysis2,3 identify single-nucleotide polymorphisms, while KmerAperture4 can capture a broader array of polymorphisms, from single-nucleotide polymorphisms to the presence or absence of entire genes. These approaches could be an effective tool for the detection of bacterial pathogen transmission, genomic epidemiology and surveillance. When compared with standard alignment-based tools, these approaches are not as sensitive, but the overall strength of association between detected polymorphisms and study covariates (for example, date of sampling) is at least as strong, with downstream phylogenetic analyses showing only minor differences. These new computationally efficient k-mer-based tools enable analyses that would otherwise be impractical — such as reference-free approaches based on pairwise comparisons across all genomes in a population. However, they perform best on sets of closely related genomes and so cannot be used with the same simplicity as sylph to study metagenomes.

Credit: Susanne Harris/Springer Nature Limited

Despite current limitations, these new methods unlock valuable options for large-scale metagenomic analysis and demonstrate that k-mer-based approaches are suited to gain biological insights from metagenomics.

Comments (0)

No login
gif