We constructed a large-scale pre-training corpus, scCompass-126M, comprising more than 120 million single-cell transcriptomes derived from humans and mice. In detail, we obtained 53,568,337 human single cells, 48,200,083 mouse single cells, and a total of 101,768,420 human and mouse single cells. More than 90% (about 94.41%) of the data were obtained from publicly available datasets from various sources, including the Gene Expression Omnibus of the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) ArrayExpress, and China National Center for Bioinformation (CNCB) (Supplementary information, Table S1). Raw sequence data were downloaded from these databases, with different regions in the world, each sample file was labeled with a unique ID, and the gene count matrices were obtained using Cell Ranger.46 The other part of the data in the gene count matrix were mainly downloaded from CELLxGENE database.
A unified bioinformatics processing pipeline was utilized for high-quality data filtering (Supplementary information, Fig. S1a). The pipeline consisted of the following steps: filtering cells with less than 200 expressed genes, filtering samples with less than 4 cells, filtering cells with more than 7 expressed protein-coding or miRNA genes, filtering cells with a proportion of mitochondrial genes exceeding 15%, filtering cells with an expressed gene number exceeding three standard deviations from the mean number among all cells in the gene expression matrix for each sample, and dropping genes not in the core gene list. Informative gene annotations from Ensembl were used to define a core gene list including protein-coding genes, lncRNAs, and miRNAs. Categories like pseudogenes, tRNAs, rRNAs, and other non-capturable loci by current single-cell transcriptomics were excluded from the analysis.
Our pretraining corpus included disease cells, cancer cells, and immortalized cell lines. Metadata descriptions in the datasets, including single-cell perturbation, cancer cell identification, gender distribution, and cell differentiation time, were used to assess the diversity of the datasets (Supplementary information, Fig. S1c).
GeneCompass architecture and pre-trainingGeneCompass architectureGeneCompass employed self-attention transformer to encode each single-cell transcriptome data. We denoted the number of transformer layers as L, the number of self-attention heads as H, and the hidden size as D. We primarily utilized a transformer with \(L=12,=12,=768\), whose total parameters reached over 100,000,000. GeneCompass operated on a sequence of 2048 genes for each cell sample, with each sequence obtained according to the corresponding ranked high expression. Given a gene, the information including gene ID, expression value, corresponding prior knowledge (promoter, GRN, gene family, and co-expression), and a special token indicating species were concatenated and further encoded into 768-dimension embeddings. A position embedding indicating the gene ranking was also added to the input. Gaussian Error Linear Units (GELUs) were employed for nonlinear activation, and the dropout probability for both self-attention and dense layers is 0.02 (standard deviation of the initializer for weight matrices is 0.02; epsilon for layer-normalization layers is 1 × 10–12). Code for model configuration, data loading, and training was implemented by Pytorch and Huggingface Transformers library.47 Furthermore, we extended the library for inputting scalable external knowledge.
GeneCompass pre-training and optimizationInspired by self-supervised learning in the NLP domain, a masked language modeling strategy8 was employed to randomly mask genes including their IDs, expressions, and prior knowledge during the pre-training. To be detailed, 15% of the genes were selected randomly to be masked in each cell. Compared with existing studies,14,15,17 GeneCompass built a multi-task learning paradigm to predict both expression value and ID of the masked genes based on the encoded gene embedding in the meanwhile. We used the MSE loss for gene expression prediction, which was defined as follows:
$$}}}_=\frac_}}}_^_}}}}_^-_^\right)}^$$
where \(_}}\) denoted the number of masked genes, \(}_^\) denoted the predicted gene expression and the ground truth expression was denoted as \(_^\). Cross-entropy (CE) loss was employed for gene ID prediction, which was defined as follows:
$$}}}_}=-\frac_}}}_^_}}}p\left(_\right)}}\, q(_)$$
where \(p(x)\) denoted the probability of the real gene ID, and \(q(x)\) denoted the probability of predicted ID. The overall training objective combines the MSE loss for gene expression and CE loss for gene ID:
$$\min (1-\beta )}}}_}+\beta }}}_}$$
where β is the hyperparameter to balance the tasks’ gradient.
The pre-training parameters were as follows: to make full use of GPU, the batch size is adjusted to the maximum allowable value of 10 for the 12-layer transformer model. The learning rate was set to linear decay with 10,000 warm-up steps, and the maximum learning rate was 5e–5 using the AdamW optimizer. GeneCompass was pre-trained for 3 epochs, in which case the loss basically did not decrease. The whole pre-training process was accomplished in 9 days using 4\(\times\)8 NVIDIA A800 GPUs.
Cell and gene embeddingWe could obtain the cell and gene embeddings from the output of last layer in GeneCompass. GeneCompass encoded each gene into a 768-dimension embedding, which contained its context information in the cell. And the embedding of the special token was used as the cell embedding that indicated the cell status.
Ablation experimentsAn ablation study of GeneCompass with the inclusion of different kinds of prior knowledge is performed using the full human single-cells corpus (~55 millions). To avoid consuming too much computing power, GeneCompass utilized a smaller self-attention transformer with \(L=6,=4,=256\). Specifically, GeneCompass with inclusion of zero, one, and all kinds of prior knowledges, respectively, were used for pre-training. To compare with Geneformer, we also re-trained Geneformer using the same dataset as GeneCompass. To clarify the impact of input data on model performance, we further built two small datasets of 5 million single cells by randomly selecting from Geneformer and GeneCompass corpus. Then, Geneformer and GeneCompass were pre-trained based on the corresponding dataset. The experimental results were based on fine-tuning these pre-trained models on full-panel multiple downstream tasks by cross-validation.
Downstream task fine-tuningFor a downstream task, the pretrained GeneCompass was further fully fine-tuned using limited data. A task-specific decoder (e.g., a dense layer) was appended to the 12-layer transformer encoder of GeneCompass. For those comparative methods of downstream tasks, all the hyperparmeters were fine-tuned with a same process as GeneCompass to guarantee a fair comparison. Specifically, we firstly used the official codebases of the comparative methods for downstream tasks; then, we performed the same parameter fine-tuning process on the models (such as learning rate, batch size, and number of iterations); finally, we used the best performance of each model achieved to compare.
Knowledge embedding and incorporationFour types of prior biological knowledge of a gene were encoded into the same 768-dimension of embedding and incorporated with gene ID and expression value, which included GRN, promoter information, gene family annotation and gene co-expression relationship.
GRN embeddingsWe used paired gene expression and chromatin accessibility data from the Encyclopedia of DNA Elements (ENCODE) to construct the PECA248 GRN. In total, 84 mouse and 76 human GRNs were generated (Supplementary information, Fig. S1f). We then embedded these gene pairs with regulatory relationships into vectors using the gene2vec method.25
Promoter embeddingsThe promoter is a noncoding sequence of a gene that serves as an activating signal for gene transcription. The promoter for each gene consists of 2500 bases, including upstream 500 bases before the transcription start site (TSS) and downstream 2000 bases. The promoter sequences were fine-tuned on the pre-trained model DNABert24 for 40 epochs to obtain promoter embeddings with 768 dimensions.
Gene family embeddingsThe human gene family data are from the HUGO Gene Nomenclature Committee (HGNC) database, and can be downloaded from https://www.genenames.org/download/custom/. The mouse gene family data were derived from the human gene family data based on the homologous genes between humans and mice, and the homologous genes are from the BioMart database, which is a sub-database of the Ensembl database. We analyzed the genes of two organisms and their family relationships as follows: 1539 gene families for mice and 1645 gene families for humans. Any two genes belonging to the same gene family were regarded as a gene pair to construct a gene pair list. All gene pairs were encoded by gene2vec method.25 We retained the structure of the initial gene2vec but modified the embedding dimension from 256 to 768 during training. Eventually, each gene was represented as a 768-dimension embedding (Supplementary information, Fig. S1e).
Co-expression embeddingsWe employed a uniform sampling strategy by selecting 3000 cells from each gene expression matrix. This step aimed to comprehensively cover the range of gene expression, capture overall features, and avoid bias towards specific cell types or expression levels. We calculated the Pearson correlation coefficient (PCC) between two expressed genes (count ≥ 1) (Supplementary information, Fig. S1d). The nonzero-based correlation coefficient calculation avoided unnecessary calculations for zero values, ensuring the practicality and interpretability of the analysis results. Gene pairs with PCCs larger than 0.8 were selected to be embedded using the gene2vec method.
Multiple downstream tasksGene embedding analysisThis task was designed to analyze the gene–gene interaction by performing in silico deletion without the comparison with perturbed conditions. The in silico perturbation was conducted by removing the target gene from single-cell transcriptome. The shift of each gene was calculated by comparing the similarity of gene embeddings between the origin and in silico perturbed status. The larger shift denoted the higher correlation with the perturbed gene. The GRN inference demonstrated in Supplementary information, Fig. S4a was performed based on in silico deletion of TFs in CHIP-Atlas related to PBMC cells on GSE43036. For each transcription factor, we calculated the precision of predicted TFs using GeneCompass and random selection.
Single-species cell type annotationTo simulate the application scenario of GeneCompass more realistically, we selected samples with different biological backgrounds and batches as the training set and testing set, respectively. To predict single-species cell types, a fully connected layer was added to the cell embeddings generated from GeneCompass. The cross-entropy loss was used as objective function. For human-specific and mouse-specific tasks, we compared pre-trained GeneCompass with GeneCompass without pre-training, Geneformer and TOSICA on human multiple sclerosis (hMS), lung (hLung) and liver (hLiver) datasets, and mouse brain (mBrain), lung (mLung) and pancreas (mPancreas) datasets. Details of the datasets used for model training and validation can be found in the Supplementary Methods. Through the hyperparameter tuning process, different models can obtain hyperparameters that were more suitable for the datasets, which also facilitates comparison between different methods. For GeneCompass, learning rate was set to 5e–5, batch size was set to 16, and number of training epochs is set to 50. For Geneformer, learning rate was set to 5e–5, batch size was set to 16, and number of training epochs was set to 50. For TOSICA, learning rate was set to 5e–5, batch size was set to 16, and number of training epochs was set to 20. For scGPT, we used the results from the corresponding dataset provided in the original article.
Cross-species cell type annotationTo predict cell types across species, we combined GeneCompass with CAME49 to obtain a heterogeneous graph neural network called GeneCompass+CAME, in which cells and genes were modeled as heterogeneous nodes (Supplementary information, Fig. S5e). Furthermore, similar to CAME, we created a heterogeneous graph with six heterogeneous types: “cell to gene”, “gene to cell”, “cell to cell”, “gene to gene”, “cell self-loop”, and “gene self-loop”, where we denoted the corresponding weights (shared across species) as \(_}\), \(_}\), \(_}\), \(_}\), \(_\) and \(_\), respectively. Unlike CAME, GeneCompass+CAME adopted gene embeddings from pre-trained GeneCompass as input.
Here, we denoted a gene expression matrix with N cells and M genes as \(X\in ^\), and the corresponding pre-trained gene embeddings as \(_\in ^^ }\times P}\). The initial embedding (the 0-th layer) for each cell i was calculated as follows:
$$__}^=\sigma \left(_^_}_}_+_^\right),$$
where σ was the leaky ReLU activation function with a negative slope of 0.05, \(_}_}_\) represented the embeddings generated from the selected homologous genes for cell i, \(_^\) and \(_^\) were learnable weight and bias vectors. In GeneCompass+CAME, we used the pretrained gene embedding \(__}^\) to initialize each gene. We then aggregated the features of the corresponding neighboring cells to update the gene embedding:
$$__}^=\sigma \left(__}^+_}}}__}^}\frac__,c}}_}^_}_}_+_^\right),$$
where \(}}}__}^\) was the set of cells that express gene \(j\), and \(_}^\) and \(_^\) were the learnable weight and bias vectors, respectively.
For the remaining layers, we used the same strategy as CAME, where we updated the embeddings of each node under the guidance of a heterogeneous graph and used a graph attention layer as the cell-type classifier. Finally, we used the cross-entropy loss with a label smoothing mechanism50 as the training loss and optimized by the Adam.51 In addition, we used the adjusted mutual information to select the best checkpoint, which was consistent with CAME. To evaluate the cell-type annotation performance, we adopted accuracy and macro-f1 as a metric. Each dataset pair underwent 20 distinct experiments using different seeds, and the average results from these experiments were used for comparison. Details of the cross-species datasets used in our study can be found in the Supplementary Methods.
GRN inferenceThis task involved the prediction of relationships among genes to gain insights into how genes work together to control cellular processes. We employed the DeepSEM framework32 as baseline in this task and then tested whether gene embeddings from the pre-defined models (GeneCompass, Geneformer, and scGPT) could benefit the output GRNs of DeepSEM. We applied each pre-trained model on Immune Human dataset14 to obtain gene embeddings. With similar tuning process above, we adjusted the learning rate and the number of training epochs in this task. We set the learning rate to 1e–4, the batch size to 64 and epochs to 40. We calculated the cosine similarity of gene embeddings as follows:
$$\cos \!}}}=\frac_\cdot _}_\right\Vert \left\Vert _\right\Vert }$$
where \(_\) and \(_\) were gene-embedding vector and ||·|| denoted the Euclidean norm. We got gene–gene relationship based on cosine similarity by setting threshold to 0.4, which was optimized from 0 to 1. If the cosine similarity was above the threshold, there existed a relationship between the pair of genes. Then, we fed the log-transformed scRNA-seq expression data derived from the BEELINE framework52 after Z-normalization into DeepSEM. Moreover, we initialized MLPs by using the “kaiming_uniform”4 and initialized \(}}\) by setting the matrix diagonal as zeros and the others following a Gaussian distribution \(}}\left(\frac}}-1},}}}^\right)\), in which m stands for a number of genes and ε denotes a small value to avoid being trapped in the local optimal. We applied ChIP-Seq data with variable TFs (Supplementary information, Table S5), generated from BEELINE framework52 as ground truth, and the values on the diagonal were fixed as zero in the entire training process to guarantee that W could obtain the regulatory network from learning procedure. Finally, we applied the gene relationships learned from the pre-defined models to update DeepSEM and then drew final GRNs. (Supplementary information, Fig. S6a).
Drug dose response predictionThis task referred to the relationship between drug dosage and the expression response of a gene. On the dataset provided by Srivatsan et al.,53 it could be described as \(}}=}}}_}}},}}}_}}},}}}_}}}\right\}}_}}=1}^}}}\), where each \(}}}_}}}\in }}}^}}}\) described the expression of G genes from cell i, if \(}}}_}},}}}=0\), perturbation j was not applied to cell i. Unless stated otherwise, the sequel assumed the column vectors. Similarly, the vector of vectors \(}}}_}}}=\left(}}}_}}.1}\ldots ...}}}_}}.}}}\right)\) comprised additional discrete covariates such as cell types or species, where each covariate was itself a vector. Specifically, \(}}}_}},}}}\) was a \(}}}_}}}\)-dimensional one-hot vector. We loaded the pre-trained CPA model and implemented CPA to predict the expression changes of specific gene (as an example, MDM2) in different drugs and dosage by following steps: (1) encoding the gene expression \(}}}_}}}\) into an estimated basal state \(\acute}}}_}}}^}}\) that did not contain any information about \(\left(}}}_}}},}}}_}}}\right)\), (2) combining \(\acute}}}_}}}^}}\) with learnable embeddings about \(\left(}}}_}}},}}}_}}}\right)\), (3) employing gene embeddings generated by the pre-defined models (GeneCompass, Geneformer, and scGPT) to the Compositional Perturbation Autoencoder (CPA)33 model. (Supplementary information, Fig. S6b).
Gene expression profiling predictionThis task involved estimating the levels of gene expression based on various biological conditions. We employed the DeepCE34 model designed for phenotype-based compound screening to assess the impact of gene embeddings generated by the pre-defined models (GeneCompass, Geneformer, and scGPT) in gene expression profiles. The framework of this task consisted of three components: the first component included the feature transformation component (GCN), pre-trained network (human protein–protein interaction network, extracted from the STRING database52), and feed-forward neural network; the second component comprised an interaction network (multi-head attention); and the last component included a prediction network (two-layer feed-forward neural network with a rectified linear unit activation function).34 In the task, we set the learning rate to 2e–4, the batch size to 32 and epochs to 120. DeepCE captured features from the chemical compounds, L1000 genes,54 cells, dosage and then concatenated with gene embeddings collected from each pre-trained model in the first component. Moreover, high-level feature associations (features from L1000 genes and chemical compound) were generated in the second component. Furthermore, in the last component, all features learned from the previous components were concatenated as high-level features. Finally, we predicted the drug-induced changes in gene expression profiles (Supplementary information, Fig. S6c).
Gene dosage sensitivity predictionsDistinguishing between dosage-sensitive and dosage-insensitive transcription factors is critical for explaining CNVs in gene diagnosis. Traditional approaches used conservation and allele frequencies to predict dosage sensitivity. However, these characteristics did not vary with cell state and cannot capture specific tissues that would be influenced by dosage changes in the gene. Following the protocol of Geneformer,15 10,000 random single-cell transcriptomes were used to fine-tune GeneCompass to distinguish between the dosage-sensitive and dosage-insensitive TFs.
In silico perturbationGEARS learns gene embedding from a gene co-expression knowledge graph. This embedding was then combined with perturbation embedding from a GO-derived knowledge graph to predict post-perturbation expression. In our study, we replaced the GEARS gene embedding with GeneCompass gene embedding (Fig. 5a). We finetuned the model (epoch = 10) using the Norman36 dataset to predict gene expression on one- and two-gene perturbations. Since most genes do not exhibit significant variation between unperturbed and perturbed states, we utilized the MSE of the top 20 DEGs as the loss function for fine-tuning.
We evaluated the effect of GeneCompass on both one- and two-gene perturbations by holding out data of specific conditions during training. The mean expression changes of the experimental data were taken as the ground truth. We calculated the absolute difference between the predicted values and the ground-truth and selected top 20 genes with the most significant changes in the perturbed experiment to calculate the sum of the differences, which was defined as top 20 DEGs deviation as follows:
$$}}\, 20\, }}^ \, }}=_^_-_$$
where \(_\) was the predicted post-perturbation expression change for gene i, and \(_\) was the ground truth of the mean expression change.
In silico quantitative perturbationThis task was designed to simulate cell reprogramming and differentiation. The perturbation state was characterized by cell and gene embeddings. The in silico quantitative knockout was performed by decreasing the target gene expression value in the single-cell transcriptome. The in silico overexpression was achieved by increasing the expression of the target gene to a specific level. Both in silico overexpression and knockout were performed as the preprocessing procedure before forwarding the single-cell transcriptome to GeneCompass. Notably, our method is capable of quantitative in silico overexpression or knockout by increasing or reducing gene expression to any value. Considering the batch effect of original and perturbed cells, a random cell sampling was employed to construct gene pairs for perturbation analysis. Especially for in silico knockout experiments, we only retained the original cells where the target genes were ranked in the top 50%, in order to guarantee the feasibility of in silico knockout.
After the manipulation of gene expression, the single-cell transcriptome was re-ranked based on the gene expression value. The post-perturbation cell embeddings were obtained from GeneCompass. The effect of in silico quantitative perturbation was measured by calculating the cosine similarity between the post-perturbation and ground-truth cell embeddings, which was defined as follows:
$$}}=\frac}}}_^\left(\cos \! }}\left(_^ },_\right)-\cos \_}}\left(_,_\right)\right)$$
where \(_\) and \(_^ }\) denoted the source and in silico perturbed cell embedding, and \(_\) denoted the ground-truth cell embedding.
Immunostaining of TF-overexpressing cellsReprogramming of hESC and cultivation of gonadal progenitor-like cell HESCs were cultured on Matrigel-coated dishes in mTeSR plus (Stemcell Techonologies #100-0276). Y27623 was removed 24 h after passaging using ReLeSRTM (Stemcell Techonologies #100-0483), and the lentiviral infection was prepared after an additional 24 h. All overexpressed genes carrying EF1α promoter were constructed on lentiviral vectors purchased from VectorBuilder Inc. After overnight infection, mTeSR 1 fresh medium (Stemcell Techonologies #85850) was replaced daily. After 72 h, Minimum Essential Medium alpha (α-MEM, Invitrogen) containing 10% KSR (Gibco), 55 µM 2-mercaptoethanol (Gibco), and 100 U/mL penicillin/streptomycin (Gibco) was used for long-term cultivation. Other factors such as hSCF and GDNF should be added appropriately if necessary.
Cells stably overexpressing TFs were seeded on Matrigel-coated chamber slides (ThermoFisher Scientific #154534pk), and immunofluorescence was performed on day 3 after plating. The slides of cells with appropriate fusion degree were fixed with 4% PFA for 10 min at room temperature and washed with PBS three times for 5 min each time. Then TBST containing 10% of normal secondary antibody host serum was used to block unspecific antigens at room temperature for 1 h. According to the instructions, the primary antibody was diluted to an appropriate proportion with cell holding buffer (BioLegend #420201), and the cells were incubated at room temperature for 1 h. After PBS washing for three times with 5 min each, the secondary antibody was diluted to a concentration of 1:500 using cell holding buffer, and the cells were incubated at room temperature for 1 h. Then the cells were washed with PBS three times for five min each and sealed with an antifade mounting medium with DAPI (Beyotime #p0131). Leica Stellaris was used for shooting, and the saved LIF file was converted to TIF format for display by Leica application suite X.
To be noted, both homologous and non-homologous genes in the pretraining and full panel of gene-level tasks were encoded into the models for a fair comparison of GeneCompass with other models.
Comments (0)