Associations between HLA-II variation and antibody specificity are predicted by antigen properties

Cohort information

This study leveraged two large cohorts from Netherlands, Lifelines (LLD) and 1000IBD. Lifelines is a prospective population study with data gathered from 167,729 individuals, including biomedical, behavioral, and psychological factors, while 1000IBD is a prospective observational cohort from Netherlands aiming to clinically characterize IBD patients [16, 43]. Analysis carried out in the present study was performed on a selected subset of samples consisting of 1443 LLD samples and 497 IBD samples. Guidelines of the Declaration of Helsinki (2013) were followed to conduct this study. All patients provided a written informed consent prior participation.

Phage immunoprecipitation sequencing

Antibody epitope repertoire and GWAS data of two cohorts from the Netherlands (LLD cohort, n = 1443 and IBD cohort n = 497 detailed in [16, 43]) was combined for the analyses in this manuscript. PhIP-Seq [18] libraries covering 344,000 antigens in total were applied by combining a 244,000 variant library primarily focused on microbiota [23] and a 100,000 variant library covering various dietary/environmental antigens and infectious diseases [26] (including the complete set of B cell epitopes of from the IEDB [46]) with detailed experimental procedures outlined previously (protein A/G pull down detecting primarily IgG from blood [16, 43]).

Enriched peptides profiles were obtained as previously done [23]. PhIP-Seq raw reads were mapped on the original oligo-peptides library (originally 344,000 variants) and normalized by comparing the number of reads per oligo to that of the input coverage, which consists of the sequencing of phage library prior immunoprecipitation. The p-values were calculated by assuming that each input level creates an output level distribution which is a generalized Poisson distribution (null distribution). The 2 parameters for the given distribution were computed at each null distribution for each sample independently using maximum likelihood method, or directly by interpolation. The resulting p-values were corrected using Bonferroni to define enrichment of each peptide per each sample. For the present study, we focused on the peptides which came from the microbiota library (n = 244,000).

SNP genotyping data

Regarding genetic data, SNP arrays were used to deeply genotype every single individual of the LLD cohort and a total of 4,546,708 SNPs were found after Minor Frequency Allele (MAF) cutoffs [41, 42]. For the IBD cohort, Gene Set Analysis (GSA) genotype data and Whole Exome Sequencing (WES) were combined as previously outlined [88] yielding, after filtering, a total of 5,386,664 SNPs. Altogether, this enabled a Genome-Wide Association Study (GWAS) between individual genes at a nucleotide resolution level and antibody reactive peptides that were found in 5 to 95% of the population (n = 2815 peptides). This cutoff range was selected to limit the multiple hypothesis testing burden as antigens rarely bound lack statistical power to detect genetic associations, and others nearly universally bound lack variation to find genetic associations [16]. Specifically, association between HLA and peptide analysis was performed using linear models in 1175 individuals, controlling for age, sex, PhIP-Seq run plate, and disease subtypes (Crohn’s disease/ulcerative colitis, only specific to IBD cohort). Additionally, genomic imputations were obtained as previously described [16]. In summary, the region containing MHC genes was extracted from chromosome 6, and SNP2HLA (v2) was used to impute HLA alleles, polymorphic amino acids, SNP variants, and indels (the Type 1 Diabetes Genetics Consortium (T1DGC) HLA Reference Panel [89] and website at (https://doi.org/https://doi.org/10.58020/63vn-7d44). For the GWAS data and genomic imputations, summary statistics based on previously generated data were used (Supplementary tables 2.2 and 2.3 of ref. [16]). Additionally, we downsized the peptides used in the present study to keep the ones coming only from the microbiota library, obtaining a total of 1835 peptides in the 5–95% interval as previously mentioned.

Regarding the pooling of data from the LLD and IBD cohorts, healthy vs. IBD status may impact the results (as IBD patients harbor altered microbiota composition, antibody responses, HLA genotypes [90]). However, towards the interplay between antibody repertoires and genetic associations, we have already compared whether there are any differences between IBD patients and healthy controls and did not find significant effects [43]. Hence, any minor bias caused by non-significant differences of including IBD patients, should be outweighed by the larger overall sample number and increased statistical power. Increasing the cohort size may however reveal differences and would then warrant revisiting the analyses presented in this manuscript. This study was performed on a Dutch population, and biases relating to genetic background, different exposures to infectious diseases, vaccination programs, etc. apply.

Antigen annotation from existing libraries

To test for effects of different functional groups of antigens on associations between antibody specificity and HLA-II variation, we exploited information that we had previously generated on the peptide libraries [23, 26] used for the GWAS study [16]. The total composition of the library included 344,000 peptides selected according to rigid criteria from many databases, accounting for high microorganism diversity across microbiota, pathogens, and viruses (244,000 peptides [23]), and also dietary and environmental antigens from allergen databases (100,000 [16, 26]). The analyses in this manuscript are primarily focused on the 244,000 variant library, as the antigen groups included were enriched for bacteria [23] rather than more diverse origins [26] allowing for in-depth studies of secreted and membrane-bound proteins, etc.. The following tools had initially [23] been used for annotating protein functions: DIAMOND [91] for annotating toxins and flagellum proteins based on UniProt reference sequences [23], SignalP 4.0 [92] for secreted proteins (using signal peptide motif recognition) and TopGraph [93] for membrane protein prediction using hydrophobicity scores on aminoacidic sequence motifs.

For the analysis shown in this manuscript, we firstly grouped peptides according to their initial annotations from ref. [23] (see therein for details), in short: (a) Controls: These peptides were divided in positive, negative, and random controls from the entire library, (b) Immune Epitope Database (IEDB)—infectious diseases: Proteins annotated as infectious diseases in the IEDB [46], (c) Autoantigens: Primarily proteins labeled as human autoreactive antigens in IEDB,(d) Gut pathogens: Collection of peptides deriving from species typically indicated as pathogens in gut inflammatory response [23], (e) Probiotic bacteria: Peptides which are originating from taxa commonly used in probiotics [94], (f) Antibody-coated strains from [95] (see ref. [23] for details on the selection), (g) Microbiota genes and strains: These proteins were retrieved from metagenomic sequencing of 953 samples [44], (h) Virulence Factor Database: This database contains a various toxins and pathogenicity factors of bacterial pathogens, (i) EBV: Peptides of Epstein Barr virus.

Reannotation and domain-based functional prediction of peptides

Briefly, when preparing this antigen library in 2017–2018 [23], we had selected membrane/intracellular/secreted proteins from gut microbiota using computational tools (SignalP 4.0 [92], TopGraph [93]). We had also used DIAMOND [91] to mine for additional toxins and flagellins starting from well-annotated reference sets [23]. In principle, these functional protein annotations would be ideal to validate our hypothesis, yet these algorithms were only run on about two thirds of the antigen library (as the library design also included cases where we incorporated entire databases such as VFDB and IEDB without any filtering/annotations). Therefore these previous functional analyses had limited relevance, and we improved and unified them by using the newest version of the respective tools (e.g., SignalP 6.0 [96]) as well as running them on the complete library (Fig. 4b, Additional file 1: Table S 7).

In detail, the information previously generated on bacterial proteins within our antigen libraries was already covering many functional aspects, based both on literature/database annotations and prediction tools ( [23] and section above). However, when the peptide library had initially been generated, the annotations were in part only run on a reduced number of peptides (or some groups [for example, as we wanted to include the complete VFDB, we did not run predictions on membrane or secreted proteins of this antigen group]) and not on the entire library. Furthermore, these annotations were generated in 2017–2018 with now outdated versions of the prediction tools (or at the time smaller training sets and databases). To improve the functional annotations with higher quality (as well as to avoid/correct mis-annotations), we ran new predictions on the entire library.

Firstly, we annotated all secreted proteins in the entire library. To this end, we used SignalP 6.0 [96] to predict if each protein contained signal peptides, necessary for the protein secretion x, by applying a deep learning model on the aminoacidic sequence. We then extracted the peptides that were predicted to originate from proteins containing a signal sequence.

For predicting membrane proteins, we ran s4pred [97] on the whole protein library, in order to get the hydrophobicity score for each protein considering their aminoacidic sequence, and we then imputed the trans-membrane domains using TopGraph [93]. In order to maintain a consistent approach, we did not discard the old annotations, and we combined new predictions with the old ones. Since these old annotations were used to select part of the antigens present in the PhIP-Seq library, they also covered the information regarding the intracellular or extracellular nature of each of the peptides selected. We made use of these old annotations as a control to assess whether the localization of peptides in the membrane could to some extent bias our analysis (Additional file 1: Fig. S 3c).

Regarding the annotation of possible toxins, we also aimed to improve annotations; we downloaded the whole non-redundant entries from the Uniprot database (June 2023) by searching “toxin” and using the curated Swissprot entries [98]. Then we ran two different algorithms: DIAMOND v2.1.6.160 [91] and MMseqs2 v13.45111 [99]. These aligners—DIAMOND based on indexed and masked sequences multi-alignment, and MMseqs2 on multi k-mer match and fast alignment—were used to map our whole protein library to the toxin database. Peptides with significant E-values in both results (intersection) were then filtered and added to the binary (presence-absence) matrix for our data analysis.

Finally, for flagellum proteins, we recollected all the flagellum annotations that we already had generated ( [23]), using many sources including DIAMOND (run on Uniprot flagellum proteins), Uniref annotations, IEDB annotations, and VFDB annotations.

According to their functional profile, we divided the peptides in additional groups, specifically “Membrane proteins,” “Secreted proteins,” “Toxins,” and “Flagellum proteins,” and as done for the category groups.

A final consideration regards viruses, which are also part of the tested PhIP-seq library. While there are viruses with larger genomes and several hundreds to thousands of genes, most severely pathogenic viruses in humans have a few dozens of genes (e.g., EBV ~ 80), while bacteria have thousands of genes. Also, not all those viral gene products are involved in the making up of the virion, and rather just expressed and translated in infected cells. Thus, the number of proteins making up a virion is for EBV about 30, about hundred times less than even a small bacterial genome. Moreover, oligomerized surface protein complexes are constituted thereof with a few proteins or even homo-oligomers composed only by a single protein. In this work, we tested only EBV, considering its proteins as “small modules” and we generally did not perform specific predictions on other viruses.

Statistical analysis

For each of the functional antigen categories outlined above, we leveraged the data obtained from the association study between genomic SNPs and antibody-bound peptides and we built a statistical analysis workflow to test for potential differences. We assessed the distribution of antibody-bound peptides (grouped by functional categories) for associations with each single HLA-II allele, to test if there was a binding “preference” towards specific groups and how this preference changes between different alleles. We then computed statistical tests for each allele “group composition” compared to the expected frequency using a chi-squared test (Benjamini–Hochberg and Bonferroni corrected, Additional file 1: Table S 1).

For testing if there was a relationship between general protein properties and significant associations with HLA-II loci, peptides from the full set of microbiota library (n = 244,000) were analyzed in three mutually exclusive categories: HLA-II associated peptides (n = 319), GWAS tested peptides (which were antibody bound in 5 to 95% of individuals in either cohort (n = 1516), and the entire peptide library (including all those peptides not tested in the GWAS study (n = 242,165). For defining the HLA-II associated peptides, we used data from the genetic imputations (made as part of the associations study), which not only gave us the presence or absence of the association between a given peptide, but also linked it to its corresponding mutation type in the HLA locus, such as a SNP, insertion, an aminoacidic change, and HLA isoform change (Additional file 1: Fig. S 2) [16]. For each antibody-bound peptide-GWAS association, we selected the most significant association to avoid redundancy of peptides. We used these imputations also to increase the number of peptides detected compared to the ones over the study-wide threshold in the GWAS. For the GWAS-associated category, we picked the remaining peptides that were used in the associations study and did not result in a significant HLA-II association. Finally, for taking in account the original library composition, we also included all the peptides that were not included in the GWAS and thus were not detected in 5 to 95% of the population. For each source category, we applied a pairwise comparison between these three peptides classes using two-sided Fisher’s exact test with Benjamini–Hochberg correction for controlling false discovery rate (Fig. 2, Fig. 4c–f, Additional file 1: Table S 2).

Comparisons using taxonomy information

We aimed to test if the results observed using categories (Fig. 2, Additional file 1: Table S 2) would also extend to the taxonomic level for the peptides used in the analysis (Methods). Using the taxonomy annotations for each peptide, we binned them to their reference organism name for each category group. We excluded all peptides which taxonomic resolution did not reach the genus level, and we calculated the prevalence in percentage for each genus (relative to its category). Moreover, we grouped all genera which total prevalence in their category was below 2% in an extra group called “others,” and kept it for the statistical analysis. We executed chi-squared tests both for checking classes differences (HLA associated, GWAS tested, and entire library) using genera distributions versus expected frequencies (Additional file 1: Table S 4), and also for checking single genera differences across the three classes against the expected distributions, in order to highlight the species that mostly changed (Additional file 1: Table S 5). Finally, we also checked pairwise comparisons between classes, identifying the bacteria which significantly were enriched (or under-represented) in a specific class compared to another one using Fisher’s exact tests (Additional file 3: Table S 6). P-values were adjusted for multi hypothesis testing (Benjamini–Hochberg and Bonferroni methods).

Machine learning using groups as predictors

We applied machine learning using peptide sources and peptide functional groups as feature predictors of the model, and the categories (HLA associated, GWAS tested, entire library) as the classes or labels to predict.

A random forest model was chosen for its high reliability in dealing with data overfitting, capturing non-linear relationships between features and outcomes, but also for providing a simple way to compute scores for feature importance. Firstly, a 2-class random forest classifier for each pair of classes (GWAS tested vs HLA associated, HLA associated vs entire library, and GWAS tested vs entire library). Since the classes were unbalanced, we down sampled them to obtain an equal number of peptides for each, and specifically to match the total number of peptides contained in the HLA-associated ones (319). Then, we randomly selected 220 peptides for each class for the training set and left the remaining 99 for the test set. To avoid a selection bias due to the strong down sampling for the “GWAS tested” and “Entire library” classes, we repeated the prediction 100 times, applying for each a new random sampling and averaging the resulting scores (100-fold iterative random sampling). Feature importance was computed using the Gini index as a metric for obtaining the mean decrease in impurity of the final tree (Gini index-based method [100]). The random forest model was run using 100 decisional trees and a default setting in the sklearn v1.1.1 library in Python (v.3.9).

Furthermore, we tested if peptides belonging to a specific class and all the other peptides could be separated. For achieving this, we ran another classification setup (3-class model) using the one versus rest method implemented in sklearn.multiclass model. In synthesis, we firstly labelled each peptide according to its group, and then we trained the model to classify them to one class or to the remaining two classes together (called “rest”). This analysis was performed to check how an agnostic model could correctly classify each peptide to its class making use of the functional annotations also tested for the statistical analysis, taking in consideration the 3 group classes in the same model. The setting used for establishing the training set, test set, down sampling, and other data transformations was the same that we explained for the 2-class model.

Ultimately, to validate these models, we replicated our findings with a regression approach as well. Specifically, we performed logistic regression using the same data, training and test set sizes, and iterative random resampling strategy applied for the random forest. To avoid overfitting, we applied lasso regularization method (Additional file 1: Fig. S 12).

All the code for the analyses is available in the following GitHub repository: (https://github.com/Vogl-Lab-Research/HLA-II_AntibodySpecificityAssociations/).

Full-length protein ELISA

To validate antibody binding of full-length proteins containing HLA-associated peptides as shown in the PhIP-Seq results, we selected six commercially available proteins which contain HLA-associated peptides from our PhIP-Seq library for analysis in a protein ELISA (results are shown in S 18). We included the viral protein EBV nuclear antigen 1 (Abcam limited, cat.no. ab138345, corresponding HLA-associated PhIP-Seq library peptide: #321, #3328, #11,813), Streptococcus pyogenes Streptolysin O (Abcam limited, cat.no. ab126650, corresponding HLA-associated PhIP-Seq library peptide: #235,068), Escherichia coli Murein hydrolase activator NlpD (Abcam limited, cat.no. ab225594, corresponding HLA-associated PhIP-Seq library peptide: #90,858), Staphylococcus aureus Clumping factor B (CUSABIO Technology LLC, cat. no. CSB-EP753442SKV, corresponding HLA-associated PhIP-Seq library peptide: #6129, #7803, #233,907, #237,594), and S. aurues Protein A (Abcam limited, cat.no. ab7399, corresponding HLA-associated PhIP-Seq library peptide: #60,579).

As expression and purification of proteins is costly and time intensive, commercially available proteins containing HLA-associated PhIP-peptide sequences were selected to verify the antibody binding. The protein ELISA was performed according to the manufacturer’s guidelines with the recommended concentrations (Protocols Abcam Indirect and Direct ELISA). In short, microtiter plates (Thermo Scientific Immulon 2 HB plates, cat. no. 10795026) were coated with 2 µg/ml protein solutions and incubated with serum samples (diluted 1:1000-fold, 1:105-fold or 1:3 × 106 fold). Antibody binding was detected with a horseradish peroxidase-conjugated anti-human IgG antibody (Southern Biotech, cat. no. 204205) and 3,3′,5,5′-tetramethylbenzidine (TMB) as substrate. Sera of 480 individuals (for whom PhIP-Seq data were also available) were tested with one of the six proteins in batches of 80 individuals per protein. To compare the ELISA results with the PhIP-seq data, the highest PhIP-seq detected fold-change of any protein-contained PhIP-peptide was paired with the blank-normalized absorption at 450 nm of the corresponding protein for each individual followed by the Spearman correlation.

Binding affinity prediction

We leveraged NetMHCIIpan 4.1 software for predicting binding affinity of Major Histocompatibility Complex of class II to demonstrate that associated alleles tend to bind full proteins which contain HLA-II-associated peptides with higher binding affinity. NetMHCIIpan v.4.1 calculates an eluted ligand score (EL) as a proxy of binding affinity on 15aa subsequences, running a sliding window which scans the whole protein sequence. The EL is then ranked (EL % Ranked score, see [101]) which allows comparisons between different proteins and alleles binding predictions. By default, general binders are identified by a score ≤ 5, and strong binders by a score ≤ 1. Predictions were run with three different settings.

Firstly, we selected a reduced set on full proteins (n = 6, same tested in the ELISA) containing HLA-II associated peptides and, when available, we retrieved the HLA alleles from HLA-II allele imputations (Andreu-Sanchez et al., 2023). Next, we ran NetMHCIIpan v.4.1 and compared EL ranks of the selected proteins against the HLA alleles they were found to be associated with. As a comparison, we ran the same proteins against all the combinations of alleles with which they were not associated, as schematically shown in Additional file 1: Fig. S 9a. The distributions of both sets EL % Ranked scores for each set, and we statistically compared the distributions of binders for both sets (EL % Ranked scores ≤ 5).

As a second test, we selected all possible proteins which contained at least one HLA-II-associated peptide (n = 84), retrieved all the associated alleles (n = 60), and tested the binding between them pairwise. On the other side, we randomly picked from non-HLA associated set (class “GWAS tested”) the same number of proteins (n = 84) which did not contain any associated peptide and then tested the binding affinity on the same alleles (n = 60) that were selected for the HLA-II-associated set. We then compared the Ranked EL score distributions to assess how the binding affinity shifted between proteins HLA associated and proteins not HLA associated (Additional file 1: Fig. S 9b).

Finally, we repeated the test with this latter set of proteins, but we specifically tested the non-HLA-associated proteins against all possible combinations with the 60 associated alleles, aiming to reduce possible biases due to the random selection of certain proteins given the 60 fixed alleles (Additional file 1: Fig. S 9c).

Comments (0)

No login
gif