In April 2024, a new framework for nucleic acid synthesis screening was issued by the White House Office of Science and Technology Policy (OSTP). The purpose was “…to encourage providers of synthetic nucleic acid sequences to implement comprehensive, scalable, and verifiable synthetic nucleic acid procurement screening mechanisms…[to] minimize the risk of misuse” [1]. This framework formalized, as a condition of receiving life sciences research funding from the US government, the October 2023 guidance from the Department of Health and Human Services (DHHS) to Providers and Users of Synthetic Nucleic Acids [2]. That framework calls for providers of synthetic nucleic acids to screen ordered sequences 200 nucleotides or longer to identify sequences of concern (SoCs). Until October 2026, SoCs are defined as sequences that are best matches to a sequence of the Biological Select Agents and Toxins List or, for international orders, the Commerce Control List except when the sequence is also found in an unregulated organism or toxin. In October 2026 the sequence length to be screened will decrease to 50 nucleotides and SoCs will be defined as sequences known to contribute to pathogenicity or toxicity, even when not derived from regulated biological agents [1].
The phrase “known to contribute” is described, in both the OSTP and DHHS policies, as requiring published experimental data and, where such data is lacking, based on similarity (“best match”) to a sequence encoding a verified pathogenic or toxic function. We have been involved in efforts to collect and describe the roles of these sequences in pathogenesis and toxicity for programs funded by the United States Government, including the Functional Genomic and Computational Assessment of Threats (Fun GCAT) program of the Intelligence Advanced Research Project Activity (IARPA) [3]. From 2017 to 2022, the computational portion of the Fun GCAT program funded tool development to answer three questions about any given sequence: (1) what is the original taxon? (2) what are its biological functions? and (3) how dangerous is it? We observed that the danger inherent in a sequence from a pathogen was dependent on what anti-host functions it possessed. When considering the intelligent design of a microbe engineered for human woe, if a sequence had a concerning function in one organism, then it seemed reasonable to assume that function could be maintained when transferred to another organism. Through extensive review of the literature, we came to realize that ‘dangerous’ sequences were found primarily among virulence factors from pathogenic microbes and secondarily from venom-producing taxa.
Pathogenic microbes—viruses, bacteria, fungi, and protozoa—cause an array of human diseases. Pathogens are distinguished from their nonpathogenic relatives, including nonpathogenic strains of the same species, by specific molecules (carbohydrates, lipids, proteins and combinations thereof as well as small RNAs) that endow them with the capacity to exploit particular hosts [4, 5]. These sequences, often called virulence factors, play essential roles in infectious disease [6].
As we attempted to categorize ‘dangerous’ sequence functions, we found that, for nonviral and nontoxic sequences, there was no existing, standardized terminology to distinguish those that were concerning from those that were innocuous [7]. To remedy this, we developed two controlled vocabularies: Functions of Sequences of Concern (FunSoCs) [7, 8] and the Pathogenesis Gene Ontology (PathGO) [9], then used these to label sequences we had annotated from the published literature in microbial pathogenesis. Other groups were also working on this in parallel [10].
Later, we realized a more universally applicable set of terms was necessary to annotate sequences in public databases. While FunSoCs were not meant to be a comprehensive set of descriptions, PathGO was established to provide more granularity. Though it proved unsuitable for annotating public datasets, PathGO did inform our revision and expansion of the gene ontology (GO) terms applicable to microbial pathogenesis. This article briefly discusses FunSoCs and PathGO then analyzes our work adapting GO biological process terms to make them pertinent to pathogenic functions of microbes.
FunSoCsFunSoCs were a quick solution for our sequence-gathering effort that began in 2018. As we have described elsewhere, at that time there was no available controlled vocabulary for denoting nonviral pathogenic sequence activity [7, 11]. We binned sequences according to what host biological process was affected: transcription, translation, cell cycle, cytoskeleton dynamics, the endomembrane system, autophagy, regulated cell death, small GTPases serving as molecular switches, and ubiquitination.
We also attempted to capture the pathogenic consequences of the sequence activity. We categorized sequences involved in adherence to host molecules, and invasion of the host, and if they enabled active dissemination of the microbe through host barriers. Intracellular pathogens often hijack host cellular components to develop a protected replicative compartment, which is also a FunSoC category.
Sequences from pathogens that damage the host were binned according to whether they (1) disabled a host organ (2) lysed or otherwise killed the host cell (3) permeabilized tissue structures or (4) caused inflammation. The cause-and-effect for sequences associated with inflammation can be particularly difficult to discern. The natural consequence of the host detecting a microbial component or microbial activity disrupting host homeostasis (translation blockage, cytoskeleton disruption, organelle stress, etc.) is activation of pathways that result in inflammatory damage to the host [12, 13]. This is a host-directed, evolved activity. But a few pathogen sequences enzymatically activate host signaling pathways to force an inflammatory response [14,15,16]. We had the most, and the most varied, FunSoCs for sequences that affected host innate immunity. If the sequence altered a microbial molecule so it was less detectable by a host sensor, then it was categorized as passive immune evasion. If the microbial sequence inactivated or disrupted a host immune component or immune effector, then it was categorized as immune subversion.
At the time of their generation, we recognized that the ~ 30 FunSoC terms we developed did not provide a sufficiently granular description of sequences of concern. They were nevertheless useful for denoting consequences of the action of microbial virulence factors during pathogenesis that it was otherwise hard to capture. Many are helpful as grouping terms [7, 8], though some, such as “subverting host innate immune signaling” were intolerably broad as they encompassed dozens of discrete cellular signaling pathways (see Table 5 below).
Ontologies for the life sciences—representing microbial pathogenesisOntologies are structured vocabularies that define concepts – and relationships between concepts – in a particular domain. Their structure allows computers to reason over them, and the biomedical informatics community has long recognized the utility of ontologies in aggregation and analysis of complex data [17]. Microbial pathogenic processes differ from homeostatic processes occurring in a single organism as the activity of toxins and virulence factors involve at least two sequences from different organisms interacting in the space of the target organism, posing a challenge for ontological representation. While GO represents inter-species interactions [18, 19], the terms are broad and have lacked specificity for pathogenicity over mutualistic interactions [20]. They have also failed to differentiate intra-organism (homeostatic) versus inter-organism effects (mutualistic or pathogenic). PathGO was an initial attempt to incorporate pathogenic interactions between microbes and hosts.
PathGOPathGO was developed as an application ontology describing mechanisms of pathogenesis to support the straightforward, unambiguous annotation of viral, eukaryotic, and bacterial genes. This application ontology fills a previously recognized gap for a focused ontology of to improve sequence annotation related to mechanisms of pathogenesis [20]. PathGO has been maintained in a public source code repository on GitHub (https://github.com/jhuapl-bio/pathogenesis-gene-ontology) [9]. PathGO utilizes a Web Ontology Language (OWL)-based data model to represent knowledge related to mechanisms of pathogenesis and observes principles set forth by the Open Biomedical Ontologies (OBO) Foundry, which aims to promote standardization and interoperability for biomedical ontologies [21]. Mechanisms of pathogenesis are specified where a mechanism is considered to be “the means by which an effect is produced or brought about”. The term space is organized in two branches that separate direct and indirect mechanisms of pathogenicity. These two terms contain thirteen and eight first level children, respectively, (Fig. 1) that capture the breadth of known mechanisms with depth elaborated through expansion of subclass hierarchies.
Fig. 1Tree view of PathGO terms. Left: Top level Structure. Right: First level substructure in direct and indirect mechanism branches
The gene ontology and biological process terms detailing pathogenic mechanismsThe Gene Ontology is a data framework representing biological systems, ranging from the molecular to the organism level, in a species-agnostic manner. As of November 2024, GO contains over 40,000 concepts encompassing signaling and metabolic pathways, developmental processes, cell cycle, etc. (https://release.geneontology.org/2024-11-03/index.html). GO has three aspects which form axes that should provide a complete picture of the function of a gene:
1)Molecular Functions represent activities performed by gene products, such as “catalysis” or “transcription regulator activity”
2)Biological Processes signify ‘biological programs’ accomplished by the concerted action of multiple molecular functions
3)Cellular Components denote the cellular location in which the molecular function of the gene product occurs
For Sequences of Concern, the main relevant branch of GO is GO:0044003 symbiont-mediated perturbation of host process. Most of the developed GO terms are shown in Tables 1, 2, 3, 4, 5, 6, 7, and 8. Molecularly, the function of the SOC can be any GO Molecular Function, since these are not restricted to interspecies interactions: for example, GO:0016248 channel inhibitor activity, GO:0010856 adenylate cyclase activator activity, GO:0090729 toxin activity. Since these are orthogonal to the BP aspect or a gene product’s function, any one of these functions can be combined with any BPs from the GO:0044003 symbiont-mediated perturbation of host process GO branch.
Table 1 Damage to host from cytotoxicity and cell permeabilizationTable 2 Damage to host tissueTable 3 Damage to host from inflammationTable 4 Immune evasion, passiveTable 5 Subversion of host immune signalingTable 6 Subversion of host immune effectorsTable 7 Host invasion, adherence to host, dissemination in hostTable 8 Manipulation of host cell biologyGO recently narrowed and clarified the definition of a biological process (one of the 3 GO ‘aspects [21]; see Gaudet et al., 2017 for more details about the organization of the GO) [22], as “the execution of a genetically-encoded biological module or program. It consists of all the steps required to achieve the specific biological objective of the module. A biological process is accomplished by a particular set of molecular functions carried out by specific gene products (or macromolecular complexes), often in a highly regulated manner and in a particular temporal sequence.” [23] This led to the removal (obsoletion) of a number of terms that were groupings of similar phenotypes rather than biological programs, such as some signaling or developmental pathways. One of these terms was “pathogenesis” (GO:0009405), which was removed in 2021. At the time of its obsoletion, over 277,000 UniProt accession numbers were annotated with the term [7]. The overwhelming number of cases resulted from automated annotations of UniProt entries containing the UniProt keyword “Pathogenesis”. The term was out of scope for GO since pathogenesis does not describe a set of coordinated activities leading to the execution of a biological program. The annotated genes were a mixed bag of sequences that either interfered with host function or affected the symbiont’s fitness. Moreover, because the term was so broad, it was impossible to assess from these annotations how the sequence was involved in pathogenesis, what was the targeted host sequence or system, and what was the targeted host taxon. Finally, this term had no children that could have provided more pertinent information.
For broader adoption of the knowledge gained during sequence annotation for the IARPA Fun GCAT project as reflected in the terms that were developed (FunSoCs and PathGO), it was decided to revise the gene ontology (GO) biological process terms under GO:0044419: “biological process involved in interspecies interaction between organisms”. GO is widely used and interoperable with other resources and ontologies. PathGO was limited in that it mixed molecular function with biological process terms so that they could not readily be imported into GO. Instead of attempting to revise PathGO, it was deemed more practical to transfer the relevant terms to GO.
GO aims to describe the normal, evolved function of genes. In the case of genes from pathogenic microbes that adversely affect the host, these have apparently evolved to exert these effects. They include suppressing immunity and overcoming barriers for the sake of microbial colonization, replication, and transmission. Secondarily, these effects produce a loss of homeostasis (= disease) in the host.
GO has been described as “under-utilized for prokaryotes, single-celled eukaryotic species, and viruses” when compared to model multi-cellular eukaryote species [24]. Expanding the terms relating to pathogenic microbes for GO should be useful for systems biologists and data scientists interested in studying the biological processes of infectious diseases. The Gene Ontology (GO) project offers a structured network of interconnected 'terms' or 'classes' that define the functions of gene products. It also explicitly links these terms to the corresponding gene products that perform these functions. This framework is particularly tailored to facilitate the computational modeling of biological systems across all organisms [21, 25].
Annotating pathogenic functions with GO termsThis revision improved GO term structure, definitions, and consistency in the biological process involved in the interspecies interaction branch (GO:0044419). During the course of the revision, we eliminated what we felt were repeated, misleading phrases referring to homeostatic regulation within an organism. We concluded that these phrases, including “negative regulation” and “positive regulation” should be avoided when describing multiorganism species interactions because the language of regulation is inappropriate for pathogenic interactions, though it could be appropriate for mutualistic or commensal interactions. For pathogenesis, the goals of the symbiont and the host are largely in conflict, with the pathogenic symbiont attempting to either subvert or evade the normal, evolved operations of the host immune system while the host attempts to limit pathogen spread within it, sometimes even suffering damage from its own response to the pathogen.
Terms describing multi-organism processes should not obfuscate which organism has the initiative when one of the two is responsible for the activity. To show this, we have resorted to the syntax of “symbiont-mediated” to indicate that a symbiont sequence is the instigator. Previously developed terms such as “viral entry into host cell” (GO:0046718) were written as agnostic so as to “annotate both viral and host proteins participating in the entry process” [24].
In the course of renovating GO, we have begun stripping from the term the type of microbe involved in the interspecies interaction (viral, bacterial, protozoal, fungal). The taxon of the sequence involved in the function described by the GO term can be determined directly from the accession number of the sequences specified. Moreover, making the terms more universal will be useful for identifying commonalities between microbial pathogens.
In authoring new terms for subversion of innate immune signaling, we tried to reference each discrete signaling component for cells that contribute to innate immune defense. We also generated new terms for sequences that (1) alter host cytoskeletal dynamics, (2) enable ‘passive’ immune evasion by altering microbial molecules so they are less detectable, (3) frustrate host complement activity, (4) adhere to different types of host cell surface molecules, and (5) change host endomembrane biology. We expanded terms to recognize different host molecules to which microbial adhesins and attachment proteins adhere (and attach), both on host cell and within the extracellular matrix.
The syntax can be generalized as:
“symbiont-mediated (perturbation/ suppression/ activation) of host [biological process]”
These new terms capture discrete ways in which sequences from microbial pathogens exploit specific host processes so that both machines and humans can better recognize them. Many of the terms are listed in the tables in the following section where they are correlated with the relevant FunSoC term. In addition, we include a supplementary spreadsheet of 320 proteins from ~ 120 species (bacteria, viruses, protozoa, fungi, and a parasitic fluke) each with a UniProt accession, and annotated with 95 of the terms, illustrating their use (Supplemental_JBS_GO-PathGO_annotations_2.xlsx). GO can be downloaded at https://geneontology.org/docs/download-ontology/ and browsed at https://amigo.geneontology.org/amigo. We hope these new and renovated GO terms will lead to general improvements in SoC annotation in secondary and composite databases. We anticipate this will allow bioinformaticians, systems biologists, and other biological data scientists to investigate commonalities across a range of hosts and symbionts.
New and renovated symbiont-host GO termsIn the following eight tables, 95 new and revised GO terms that are children to “biological process involved in interspecies interaction between organisms” (GO:0044419) and directly relevant to microbial pathogenesis are presented. Tables 1, 2, and 3 detail ways in which a microbe can damage a host. Table 4 lists processes relevant to how a microbe can evade the host innate immune detection. Table 5 describes processes by which a microbe can actively subvert host innate immune signaling. Table 6 contains processes by which the symbiont frustrates host innate immune effectors downstream of signaling. Table 7 lists terms related to attachment (adherence), invasion, and dissemination in a host. Table 8 describes some of the ways in which a parasitic symbiont can manipulate host cell biology.
Comments (0)