The data for this study consists of SNP-array genotyping data from 121 073 Helsinki Biobank samples, produced in the FinnGen project (www.finngen.fi/en). Genotyping of blood-derived DNA was performed using FinnGen ThermoFisher Axiom custom array v1 (N = 34 504; 28% of the samples) and v2 (N = 86 569; 72% of the samples) as described elsewhere [6]. The unique probe set content in v1 and v2 arrays are 760 904 and 723 375, respectively, with an overlap of 708 552 probe sets between the arrays based on chip manifests. The protocol for this study was approved by the Helsinki University Hospital (HUS) Regional Committee on Medical Research Ethics (HUS/428/2024).
Array data PreparationMLH1 exon 16 deletion, NM_000249.4 (MLH1):c.1731+2247_1897–402del (ClinVar variation ID 1332889), is a 3 538 base pair (bp) in-frame deletion at chromosomal location 3:37044575–37048112 (GRCh38). Both FinnGen genotyping arrays (v1 and v2) contain a total of 21 probe sets for 18 unique loci in the deletion region. In addition to these, probe sets for 50 loci flanking both sides of the deletion region were included in the analysis. To allow combining the data from both array types, only overlapping probe sets present on both arrays were considered in the flanking regions.
The intensity values for both alleles of the selected probe sets were extracted from the raw array data CEL files with Analysis Power Tools (APT) Release 2.12.0 (Thermo Fisher Scientific). The appropriate tools from the software package were run without variant calling steps to only extract intensity values for all the samples. The software performed default artefact removal, probe summarization and normalization steps for the extracted intensities.
Analysis of MLH1 exon 16 deletionFor each sample, the sum of the intensities of both alleles of all probe sets interrogating the same locus was calculated to represent the total chromosomal signal intensity at that position. These locus-wise summed values were further quantile normalized with respect to the standard normal distribution over all samples.
To identify the samples with MLH1∆Ex16, two features were used for cluster analysis: the difference between median intensity of the deletion and flanking regions, and the median absolute deviation (MAD) of the intensity values. MAD was calculated in a piecewise manner over the deletion and flanking regions. As a robust measure of dispersion, MAD feature was included to aid in differentiating the noisy unclassifiable samples characterized by large within-sample intensity variances.
Given the unbalanced cluster sizes due to the rarity of the deletion, typical clustering algorithms tended to work inconsistently and require a lot of parameter fine-tuning. Preferably, additional datasets would be needed to validate the choice of the algorithm and its parameters. In the absence of multiple datasets, simple thresholding rules based on the visual inspection of the cluster plot were used to determine samples with suspected deletions.
Confirmation of variant carriers, electronic health record (EHR) review and sensitivity assessmentFor confirmation of the results, a manual detailed EHR review was conducted of the identified putative MLH1∆Ex16 carriers to assess whether the deletion had been previously identified by diagnostic testing in health care. The variant status for sample donors suspected with the deletion but without existing information of the deletion variant, MLH1, LS or HNPCC in EHR were validated from existing HBB DNA samples with a polymerase chain reaction (PCR) assay in the accredited genetic diagnostic laboratory of HUS Diagnostic Center. The demographics of identified MLH1∆Ex16 carriers were collected from EHR, and personal history of cancer was determined using EHR-extracted International Classification of Diseases (ICD)-10 codes.
The sensitivity of the method was assessed by conducting a search for diagnostically determined MLH1∆Ex16 carriers in the hospital EHR database, covering all individuals from whom genotyping information was available. The search covered unstructured free-text medical notes and statements. Text snippets with short context that included the MLH1 gene name with common spelling variants were extracted from the EHR databases and manually reviewed to identify diagnosed MLH1∆Ex16 carriers and individuals confirmed negative for the deletion.
Comments (0)