Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering

Construction of protein point cloud

An arbitrary protein structure can be represented as a contact map (a L × L matrix where each element is the distance between two residues of a protein with length L) or a graph (e.g., nodes are residues, and the edges are inter-residue interactions). While a contact map disregards atomic structural information (e.g., coordinates of each atom), a graph contains fine-grained details in protein structures but is often computationally expensive. Point cloud is a commonly used format for 3D data and has shown tremendous success in many areas, including computer vision, autonomous driving and robotics.78 A point cloud can be represented as an unordered set of 3D points , where all points are homogeneous and each point Pi is a vector of its (x, y, z) coordinates. Compared with contact map and protein graph, point cloud has the advantage of preserving atomic geometric information without sacrificing computational efficiency.

Compared with the naïve point cloud, which is unordered and homogeneous, our proposed protein point cloud consists of ordered and heterogeneous points that are extracted from its raw structure. Specifically, each point corresponds to the alpha C atom of an amino acid. In addition to the 3D coordinates of each point (x, y, z), the type of residue each point belongs to (R) and the position of each residue in the protein sequence of length L (P) is attached as point features.

Definition of each point in protein point cloud: [x, y, z, R, P],

R ∈ (G, A, V, L, I, S, T, C, M, D, E, N, Q, R, K, F, Y, W, P, H)

P ∈ (1, 2, 3, …, L)

Architecture of the multimodal deep representation learning model

To decipher protein functions at the residual resolution, we develop a multimodal protein representation learning model (~659.3 million parameters). It applies an encoder–decoder architecture to simultaneously learn sequence context and structure context from millions of proteins (Supplementary information, Fig. S1). For a protein of length L, the encoder takes the masked sequence and the masked protein point cloud as input and generates a K-dimensional feature vector for each amino acid. The latent representations (L × K) are then fed into the decoder to complete the missing elements of the corrupted sequence and protein point cloud. K is set to 1280 during training and inference.

The sequence embedding module, the transformer encoder module and the sequence decoder module apply similar networks to that of the current protein language models.23,24 Specifically, the transformer encoder module is a 33-layer stacked Transformer, and each layer consists of one layer normalization block, one 20-head attention block and one feed-forward network. The sequence decoder, comprising two linear layers with GELU activation and layer normalization, serves to decode the multimodal features (L × K) of a protein into the probability distribution of each token in the alphabet (L × 33).

The global features of a protein tertiary structure should be invariant to arbitrary input poses, which means that 3D translations and rotations of the input protein structure should not affect the output. To guarantee such invariance, we chose the NVIDIA-optimized version of SE(3)-Transformer79 as the structure embedding module, which contains one 8-head attention block interspersed with one normalization module, one TFN layer and one max pooling layer. We used 1 layer SE(3)-Transformer for large-scale training. The structure decoder employs a three-layer multi-layer perceptron (MLP) network with ReLU activation to decode the multimodal features (L × K) of a protein into the 3D coordinates of each alpha C atom (L × 3).

To capture the structure context of a protein, the structure embedding module first calculates the K nearest neighborhoods centered on each point as well as their relative positions. Next, an equivariant weight matrix is built upon the Clebsch–Gordon coefficient and spherical harmonics to guarantee the equivariance of point features during transformation. Third, the attention mechanism is applied to pass features between adjacent points. Finally, point features are aggregated and pooled to output the final structure context.

Model training

We used proteins from the AlphaFold protein structure database as the self-supervised training dataset. It contains ~200 million structures predicted by AlphaFold2. We removed proteins shorter than 64 aa and those with average pLDDT (predicted Local Distance Difference Test) score lower than 70. It deserves to be mentioned that we did not impose a minimum threshold on the average pLDDT of a predicted structure during the inference stage. We randomly selected ~0.5 million proteins for validation. The final training dataset contains ~160 million proteins. Both amino acid sequence and protein point cloud were extracted from the raw protein structure for multimodal training. Since ~95.88% Uniparc sequences contain fewer than 1024 aa, we set the context size to 1024. For proteins longer than 1024 aa, we sampled the start position of amino acids from uniform distribution [1, n – x + 1] where n is the length of protein minus 1024, and x is sampled from uniform distribution [0, n]. For proteins shorter than 1024 aa, padding tokens were appended to their sequence, and random alpha C atoms selected from the raw structures were appended to the extracted protein point clouds.

The extracted amino acid sequence and protein point cloud were then corrupted and recovered by the proposed multimodal model during training. To mask the protein sequence, we randomly sampled 15% of tokens from the sequence after tokenization as utilized in BERT80 and ESM.23 Each of the sampled token was replaced with a special mask token with 80% probability, a randomly chosen alternate amino acid token with 10% probability, and the original input token (i.e., no change) with 10% probability. To mask the protein point cloud, we referred to the commonly used mask ratio in current point completion networks81,82 and chose to mask 256 points (25% of 1024) from the original data. Specifically, we calculated the central point of the protein and chose 256 nearest neighbor points centered on it. We masked the coordinates of these points and trained the proposed multimodal network to automatically recover them.

The loss function is a sum of a categorical cross-entropy (CE) loss and a permutation-invariant chamfer distance (CD) loss.83 In particular, the CE loss measures the differences between the model’s predictions and the true token for masked amino acid sequence. The CD loss quantifies the completion results by calculating the average nearest squared distance between the recovered protein point cloud and the ground truth. By minimizing the CE loss and the chamfer distance loss, our proposed model learns high-order representations of a protein in a self-supervised manner.

All layers except the transformer encoder module are initialized from a zero-centered normal distribution with a standard deviation of 0.02. The transformer encoder module is initialized with parameters of ESM1b.23 We trained the multimodal deep representation learning model for 380 K steps using Adam optimizer (β1 = 0.9, β2 = 0.999) at initial learning rate 1e−4 with batch size of 480. The learning rate increases linearly during a warm-up period of 10,000 steps. Afterward, the learning rate follows an inverse square root decay schedule. The training was conducted on 15 nodes interconnected with InfiniBand, where each node contains 8 NVIDIA A100 GPUs.

Benchmarking multimodal representations with function-related datasets

We used 15 function-related datasets (see Supplementary information, Table S3) to benchmark the performance of our multimodal deep representation learning model compared to a comprehensive suite of baselines. We used the standard partition of each dataset and ensured that the training set and the test set were non-redundant. Protein representations generated by our model are either fed into an MLP or integrated into a customized model to make final predictions. According to the downstream network, two types of representations are used, including the residual-level representation of the protein (protein length × 1280), and the molecular-level representation that averaged across the length of the protein (1280). Details are introduced as follows.

EC annotation tasks

EC number84 is a commonly used classification scheme that specifies the catalytic function of an enzyme by four digits. Four diverse datasets were used for benchmarking. The EC-PDB dataset, which was constructed by Gligorijevi´c et al.,34 consists of 19,198 non-redundant proteins and covers 538 third and fourth levels EC numbers of the EC tree. It was partitioned into training, validation and test sets, with approximate ratios of 80%/10%/10%. Proteins in the test set have varying degrees of sequence identity and structure identity (30%, 40%, 50%, 70% and 95%) to the training set (see Supplementary information, Tables S4, S5). Hermosilla et al.85 constructed the EC-384 dataset, which contains 37,428 proteins from 384 fourth-level EC numbers. The entire dataset was split into training, validation and test sets. In addition, proteins in each set do not have more than 50% of sequence similarity with proteins from the other sets. The EC-New-392 dataset and the EC-Price-149 dataset are test sets used in contrastive learning-enabled enzyme annotation (CLEAN).86 Specifically, EC-New-392 consists of 392 proteins covering 177 different EC numbers from Swiss-Prot released after April 2022. EC-Price-149 is a collection of 149 proteins validated by experiments described by Price et al.87 We replaced the raw input of CLEAN with representations generated by our proposed model and kept the model unchanged. Then we used the same training set to train CLEAN (denoted by Ours-CLEAN) and tested its performance on both EC-New-392 and EC-Price-149.

GO annotation tasks

GO annotations capture statements about how a gene functions at the molecular level (MF), where in the cell it functions (CC) and what biological processes (BP) it is involved.88,89 The GO-MF, GO-CC and GO-BP datasets were constructed by Gligorijević et al.34 They only selected GO terms with at least 50 and no more than 5000 samples, forming ~36 K non-redundant PDB chains that cover 489, 320 and 1943 GO terms at different hierarchies in MF, CC and BP, respectively. We used the same partitioning scheme to split these protein sequences into training, validation and test sets, with approximate ratios of 80%/10%/10%. Proteins in the test set have varying degrees of sequence identity and structure identity (30%, 40%, 50%, 70% and 95%) to the training set (see Supplementary information, Tables S4, S5). Furthermore, each protein in the test set has corresponding experimentally determined PDB structures and at least one experimentally determined annotation. Each GO term was treated as a separate label during training and testing. Hierarchies and distances between GO terms were not considered.

Cross-species PPI prediction tasks

The cross-species PPI prediction task involved experiments conducted on the D-SCRIPT dataset,90 which is derived from the STRING database. This dataset comprehensively covers PPIs across various species, including human, mouse, fly, yeast and Escherichia coli. In the human subset, there are ~38,000 PPIs in the training set and 25,000 in the test set. All PPIs from other species were integrated into the test set, with 22,000 PPIs for E. coli and 55,000 for fly, yeast and mouse. Negative samples were generated by randomly pairing proteins from the non-redundant set, with their quantity being ten times that of the positive samples. This methodology aligns with the understanding that true PPIs are infrequent. For predictive modeling, a human PPI-trained model was employed to predict PPIs in the test set. Importantly, all protein structures used in this study were sourced from the AlphaFold protein structure database.

Virus–human PPI prediction tasks

The virus–human PPI prediction task is conducted in our study. We employed three datasets curated by Dong et al.91 to assess the performance of our model in predicting virus–human PPIs. These datasets encompass interactions between human proteins and virus proteins such as Ebola and H1N1. Each dataset comprises thousands of human proteins and hundreds of virus proteins (refer to Supplementary information, Table S6 for details). The structures of human proteins were obtained from the AlphaFold protein structure database, while the prediction of virus protein structures was carried out using ESMFold. The number of both negative and positive PPIs for training in the Ebola dataset was 11,341, while 150 non-redundant PPIs were used as the test set. In the H1N1 dataset, the training split includes 10,858 negative and positive PPIs, while the test split comprises 381 non-redundant instances of negative and positive interactions. In the DENOVO dataset, the training set is composed of 5020 positive samples and 4734 negative samples, and the test split consists of 425 non-redundant instances of positive and negative PPIs.

Multi-class PPI prediction tasks

The multi-class PPI prediction task is undertaken using two datasets, namely SHS148K and STRING, curated by GNN-PPI.92 These datasets comprise 44,488 and 593,397 multilabel PPIs, respectively. Both two datasets employ two heuristic evaluation schemes based on the PPI network for splitting. Specifically, the Breadth-First Search (BFS) and Depth-First Search (DFS) algorithms are utilized to explore ~20% of the PPIs, forming the test set. All PPIs are categorized into seven types: activation, inhibition, reaction, binding, expression, catalysis and post-translational modifications. Each pair of interacting proteins is associated with at least one of these labels. The protein structures utilized in the analysis are retrieved from the AlphaFold protein structure database, encompassing 5189 proteins in SHS148K and 15,335 proteins in STRING.

For EC-PDB, EC-384, GO-BP, GO-MF, GO-CC, PPI-Mouse, PPI-Fly and PPI-E. coli, we respectively constructed an MLP classifier as described in ref. 33 to decode the representations generated by different methods (see Supplementary information, Tables S1, S2). For PPI-SHS148K and PPI-STRING, we constructed an eight-layer stacked transformer with a hidden size of 256. While CNN, ResNet, LSTM and Transformer were initialized randomly and trainable,93,94 the parameters of other pre-trained models were frozen during training. In particular, pre-trained parameters of UniRep,7 ESM1b23 and ProtT524 were downloaded and used. Models and results of other pre-trained models were obtained from corresponding publications (Supplementary information, Table S7).33,85,92,95,96,97,98,99

For the rest of the datasets, our proposed model acted as a plugin model that generated latent representations for proteins and utilized existing customized models for further prediction. Specifically, we used CLEAN,86 for EC-New-392 and EC-Price-149. We replaced the raw input of CLEAN with representations generated by our proposed model and kept the model unchanged. For PPI-DENOVO, PPI-EBOLA and PPI-H1N1, we used multimodal representations as the input of the graph model proposed by Dong et al.91 and kept all hyper-parameters unchanged. Results of other methods were obtained from corresponding publications.

We probed the robustness of our proposed model in learning sequence–function and structure–function relationships from proteins with low sequence/structure similarity on downstream tasks. Specifically, BLAST was used to align the protein sequences of the test set to the training sequences and computed the identity score. We used TMScore100 to assess the topological similarity between protein structures of the test set and the training set. Five similarity cutoffs were used to partition each test set into multiple groups (see Supplementary information, Tables S4, S5).

Multimodal context benchmarking

We used three benchmarks to investigate whether the sequence context and the structure context of a protein could be captured by ProMEP.

Protein sequence context

We evaluated the micro-structure perception ability of our proposed multimodal network by quantifying its attention on functional sites. During training and inference, it applied the attention mechanism101 to probe sequence context and structural context. Calculating the attention score between residues of a protein allowed us to identify key positions that the model focused on. We randomly selected 10,000 proteins from Swiss-Prot and filtered out proteins without functional site annotations. We used the 80% identity-filtered subset, which contains 1325 proteins. We ranked all residues based on the attention score during the evaluation. Specifically, we computed the average attention score across all 33 layers and 20 heads during the quantification of the attention score for each amino acid in the sequence. Examples of attention visualization are shown in Fig. 3c and Supplementary information, Fig. S6a.

DeepFRI,34 HEAL42 and a random approach (denoted as Random) were used as baselines. Specifically, DeepFRI is a graph convolutional network that employs a pre-trained protein language model to extract sequence features and further constructs a residue graph to predict protein functions. In addition, DeepFRI could identify the contribution of each residue to the predicted function. We utilized the official model checkpoints of DeepFRI and used the Molecular Function branch for evaluation. HEAL is a deep learning model for protein function prediction, which could capture structural features via a hierarchical graph transformer. We downloaded the pre-trained parameters of HEAL and applied the gradient-weighted Class Activation Map (grad-CAM)102 to rank the activation score of each residue. Random is a baseline strategy that randomly ranks the importance of each residue. We reported the average performance of Random across five runs. We used the commonly used Top-1 HR, NDCG and MRR as evaluation metrics.

Local structure context

To benchmark the ability of the proposed multimodal network to capture local structure context, we used the dataset constructed by Klausen et al.103 as the training and validation set. It contains 10,837 crystal structures obtained from PDB that filtered at the 25% identity threshold as well as the 2.5 Å resolution threshold. Among these structures, 10,337 structures were used as the training set, and 500 randomly selected structures were left out as the validation set. CB513,104 CASP12104 and TS115105 were used as test sets, which contain 507, 21 and 115 non-redundant structures, respectively. All proteins in the test set share no more than 25% sequence identity with the proteins present in the training set. Each amino acid of each structure was mapped to 8-class (Q8) secondary structure labels as described in ref. 106 Our proposed model acted as an encoder that generated residual-level representations, which were then fed into an MLP classifier as described in ref. 94 It is important to acknowledge that during the pre-training stage, our proposed multimodal deep representation learning model processed unlabeled protein structures without prior knowledge of the actual secondary structure labels assigned to each amino acid. Results of other methods were obtained from ref. 94 (Supplementary information, Table S7).

We also constructed two small-scale datasets to further benchmark multimodal representations in the condition of scarce training samples. The PDB-100 dataset consists of 100 single-domain proteins that are randomly selected from the SCOP database.107 For each protein, we obtained its structure and secondary structure annotations for each of the positions from the PDB.62 In addition to the SCOP-100 dataset, we constructed the CATH-100 dataset by randomly selecting 100 proteins from the dataset collected by Zhou et al.108 Each protein in CATH-100 has experimentally determined 3D structure as well as annotated B-factor and solvent-accessible surface area for each of the positions. While PDB-100 is a 3-class classification task, CATH-100 corresponds to two regression tasks. We compared the performance of our proposed multimodal network with several leading methods, including UniRep, ESM1b and ProtT5. For each task, we constructed a random forest model to make predictions on the basis of representations generated by each of these methods, respectively. We report the performance of each model in a 5-fold cross-validation manner on the entire dataset. We used the random forest model trained on SCOP-100 for subsequent secondary structure benchmarking and visualization.

Global structure context

The SCOPe database organizes protein domains into multiple hierarchies, including Family, Superfamily and Fold.107 In particular, the basis of classification for Folds is purely structural. As described in ref. 109 we used the 40% identity-filtered subset of SCOPe v2.07 as the benchmark set. It contains 13,265 domains that can be classified into seven classes (see Supplementary information, Table S8). We constructed a five-layer MLP (batch size: 24, learning rate 3e−5, dropped out ratio 0.2, Adam optimizer) as the decoder to classify multimodal representations to a specific fold class. We report the average F1 score of the 5-fold cross-validation results on the entire dataset. Leading structure-based methods were used as baselines, including GraSR and DeepFold. GraSR uses a contrastive learning framework to capture protein features from a protein graph of protein structure.109 DeepFold extracts structural motif features from protein contact maps via a deep convolutional neural network.110 SGM103 and SSEF104 are classical structural classification tools that uses 30 global backbone topological measures and frequencies of 1500 secondary structure triplets to encode protein structures, respectively. The performance of these methods are obtained from refs. 109,110 (Supplementary information, Table S7).

Zero-shot protein fitness modeling

ProMEP quantifies the log-likelihood of protein variants under the context of both sequence and structure. The calculation is shown in the equation. To model the fitness of a protein, the WT sequence S and structure context C are fed into ProMEP, which in turn outputs a sequence of log probabilities. We calculated the conditional probabilities of mutated amino acid mt and WT amino acid wt at each mutational position t. The sum over all mutated positions T is the final fitness score of a protein variant.

$$_\log p(_=_^}}}}}}_,C)-\log p(_=_^}}}}}}_,C)$$

We first evaluated ProMEP on three representative datasets. Specifically, the UBC9 dataset was constructed by Weile et al.35 and comprises experimental data obtained from growth-based complementation assays, where the fitness of 2563 protein variants with single mutations was estimated. Roscoe et al.36 constructed the RPL40A dataset that contains 1380 single mutations by quantifying yeast growth rate as a measure of experimental fitness. The protein G dataset12 contains 1045 single mutations and 535,917 double mutations with measured binding affinity.

We then used 66 DMS datasets that cover 53 proteins from the ProteinGym benchmark26 for the generalization test. All proteins derived from prokaryotes, human and other eukaryotes collected in the ProteinGym benchmark were included. Nineteen proteins derived from viruses were not used for evaluation because of the biases in the pre-training dataset.30 We used ESMFold to predict the structure of the WT protein in these datasets. For each WT protein, we collected ~300 homologous sequences from the NR database with sequence identity lower than 80% and predicted their structures via ESMFold. We then used these homologous samples to fine-tune ProMEP for 3 epochs with a learning rate of 1e−4. The fine-tuning procedure enables ProMEP to gain a better understanding of sequence and structure contexts from homologies sampled through evolution. AlphaMissense, which is the best method for mutation effect prediction, was used as a leading baseline. The SOTA protein language model, ESM2_3B and ESM2_650M, as well as the structure-enhanced protein language model, ProstT5, were also evaluated as baseline models. To predict mutation effects with ProtsT5, we first extracted residue embeddings of each protein, and then used VESPA to calculate the logits score. The predicted score of multiple mutations in ProtsT5 was obtained by adding the score of each single mutation. Results of other baseline methods were obtained from the ProteinGym benchmark (Supplementary information, Table S7).26

Pathogenic variant classification

The datasets employed for the pathogenicity prediction task are derived from two distinct sources: the ClinVar test set111 and de novo variants identified in individuals with rare diseases.

The ClinVar test set comprises a total of 30,884 pathogenic and 51,988 benign variants, distributed across 7951 proteins. To validate the ProMEP and AlphaMissense models on proteins characterized by shallow MSA depths, proteins with fewer high similarity sequences were selected. Specifically, we ascertained the number of high similarity sequences for proteins within the ClinVar dataset by aligning them with the Non-Redundant Protein Database using BLAST. Hits with an identity below 0.95 or coverage below 0.5 were systematically filtered out. We adhered to the criteria established by AlphaMissense,28 wherein proteins are retained only if they meet a minimum threshold of five benign and five pathogenic variants. We chose proteins with high similarity sequences < 100.

The second dataset is composed of de novo variants identified within both patients and healthy controls participating in the DDD cohort.43 This dataset comprises a total of 410 variants distributed across 156 proteins.

In the pathogenicity prediction task, ProMEP is not fine-tuned. For a given position i within the protein, the pathogenic likelihood of a residue undergoing mutation from amino acid type w to type m is calculated as follows:

$$_\left(\right)=}}}}}\left(}_^-}_^-\tau \right)$$

(1)

where logit represents the output of ProMEP, τ signifies a handcraft threshold. Specifically, τ is set to 6 during evaluation. A variant is considered pathogenic if Pi (w|m) > 0.5.

In addition to AlphaMissense, we performed a comparative analysis with several MSA-free baseline models, including ESM1b, ESM1v and Tranception. Pathogenicity calculations for ESM1b and ESM1v were conducted utilizing Eq. 1, with τ maintained consistently in accordance with ProMEP guidelines. Equation 1 is similarly applied to Tranception; however, τ is specifically set to 0. It is noteworthy that the hyperparameter τ exclusively impacts the accuracy metric, but with no discernible influence on the auROC metric.

ProMEP vs AlphaMissense and GEMME timing experiments

We randomly collected proteins with lengths from 100 aa to 1000 aa from UniProt. We compared the model inference of ProMEP and AlphaMissense. Time costs of preparing input for the model (e.g., searching MSAs from sequence databases in AlphaMissense, or predicting protein structure from protein sequence) and model initialization (e.g., Jax graph compilation times in AlphaMissense, or pre-trained weight loading times in ProMEP) were excluded. Since the trained AlphaMissense model weights are not released, we used randomly initiated weights during evaluation and did not count the weight loading times in AlphaMissense. We evaluated the model inference time of predicting mutation effects via ProMEP or AlphaMissense. For each length, we report the average time costs of at least three proteins. All experiments were run on a single NVIDIA V100 GPU.

ProMEP-guided protein engineering of TnpB

We began by ranking protein variants with single mutation. Specifically, we constructed a virtual saturation mutagenesis library that only contains single variants (7752 variants). We then ranked all variants via the calculated fitness score. Since X-to-R mutations (e.g., S72R) are commonly used in the engineering of CRISPR-Cas proteins, we chose the top 10 beneficial X-to-R variants and top 10 deleterious X-to-R variants from the entire ranked list for further evaluation. We also constructed a virtual mutagenesis library that consists of triple X-to-R mutants (8,510,740 mutants). Again, we calculated the fitness score of each variant. According to the experimental data from top 10 beneficial X-to-R single mutants, we filtered out mutants that contain neutral or negative mutations (Y388R, S217R, L398R, T405R, L406R, K44R and H403R) from top-ranked beneficial mutants. The top 10 beneficial triple mutants from the mutagenesis library were selected for further evaluation. P values were derived by a two-tailed Student’s t-test. All statistical analyses were performed on n  =  3 biologically independent experiments.

To generate variants with more mutations, we fine-tuned ProMEP based on the experimental results of TnpB mutants. The primary aim is to distinguish beneficial mutations from deleterious ones. In particular, TnpB mutants exhibiting a fold change > 1.2 were designated as positive samples, while those displaying a fold change < 0.8 were classified as negative samples. The fine-tuning dataset, consisting of 27 samples, was divided into a training set (80%) and a validation set (20%). We used a binary CE loss during the fine-tuning process:

Comments (0)

No login
gif