Integrated multiomics analysis and machine learning refine molecular subtypes and prognosis for thyroid cancer

2.1 Dataset source and data pre-processing

Multi-omics data for THCA patients were obtained from The Cancer Genome Atlas (TCGA-THCA) via the TCGAbiolinks R package [15], including whole transcriptome expression (mRNA, lncRNA), DNA methylation, somatic mutations, and clinical information. Mature microRNA (miRNA) expression profiles were identified using the miRBaseVersions.db package. DNA methylation and clinical annotations were retrieved from UCSC Xena (https://xena.ucsc.edu/), and mutation data were processed using the maftools package.

Patient inclusion criteria were as follows:

Confirmed THCA diagnosis.

Complete clinical information.

Availability of all multi-omics profiles (mRNA, miRNA, lncRNA, methylation, and mutation).

Overall survival (OS) or progression-free survival (PFS) > 30 days.

Patients failing to meet any of these criteria were excluded. Missing expression or methylation values (< 5% missingness per feature) were imputed using K-nearest neighbors (KNN) imputation. Expression data were normalized using log2(TPM + 1) transformation. For methylation, β-values were used and quantile normalization was applied across samples. All omics matrices were scaled (z-score standardization) prior to integration.

To validate the immunotherapy predictive performance, external datasets (GSE91061, GSE78220, GSE135222 from GEO, and IMvigor210 from the IMvigor210CoreBiologies R package) were incorporated [16]. Only patients with available PFS and treatment outcome annotations were included.

2.2 Multiomics consensus analysis

We employed the MOVICS R package [17] to conduct integrative multiomics clustering and feature selection. For transcriptomic data (mRNA, lncRNA, and miRNA), the top 1,000 genes with the highest median absolute deviation (MAD) were selected to capture the most variable expression features across samples. To identify features associated with patient survival, univariate Cox proportional hazards regression was applied, and genes with significant associations (p < 0.05) were retained. For somatic mutation data, the top 5% most frequently mutated genes were selected using the freq method within MOVICS.

Prior to integration, each omics layer (mRNA, lncRNA, miRNA, DNA methylation, and mutation) was independently normalized via z-score transformation to account for differences in measurement scales. All selected features were concatenated into a unified matrix (samples × features), which served as input for downstream clustering.

Multiomics integration and clustering were performed using ten state-of-the-art algorithms provided in MOVICS: CIMLR, ConsensusClustering, Similarity Network Fusion (SNF), iClusterBayes, PINSPlus, moCluster, NEMO, Integrative Non-negative Matrix Factorization (IntNMF), Cluster-Of-Clusters Analysis (COCA), and Latent Representation Analysis (LRA). To determine the optimal number of clusters (k), we utilized the getClustNum and getMOIC functions. These functions evaluate clustering quality based on multiple stability metrics, including the cluster prediction index (CPI) and silhouette width. The CPI is a resampling-based metric that quantifies the reproducibility of clustering by assessing the consistency of sample assignments across multiple iterations. A higher CPI value indicates more stable and robust clustering. The optimal cluster number was defined as the local maximum of the CPI curve.

Finally, consensus clustering was conducted by aggregating subtype labels across the ten algorithms through majority voting, resulting in a robust molecular classification of THCA into three consensus subtypes (CS1–CS3).

2.3 Characterization of subtypes and functional annotation

To evaluate subtype-specific molecular characteristics, we performed gene set variation analysis (GSVA) [18] using hallmark oncogenic and immune-related pathways. Tumor immune and stromal infiltration scores were estimated using the ESTIMATE and IOBR packages, while immune checkpoint expression and tumor-infiltrating lymphocyte signatures were compared across subtypes. Tumor methylation-inferred lymphocyte (MeTIL) scores were computed using established methods. Transcriptional regulatory networks were reconstructed using the RTN package, focusing on 23 cancer-related transcription factors and 35 chromatin-modifying regulators [19]. To evaluate subtype reproducibility, we selected the top 100 most distinctive genes from each cluster and assessed classification consistency between PAM (Partitioning Around Medoid) and NTP (Nearest Template Prediction) classifiers.

2.4 Development of the machine Learning-Based prognostic signature (CMLS)

Subtype-specific genes were further filtered by univariate Cox regression in the training cohort. A total of ten machine learning algorithms were used for prognostic signature construction: random survival forest (RSF), LASSO, Ridge, elastic net (Enet), CoxBoost, generalized boosted regression modeling (GBM), survival support vector machine (survival-SVM), supervised principal components (SuperPC), partial least squares regression for Cox (plsRcox), and stepwise Cox regression.

For each method, models with fewer than two genes were excluded. The average concordance index (C-index) across five-fold cross-validation was computed for each algorithm. The best-performing model (highest average C-index) was selected to define the Consensus Machine Learning-Driven Signature (CMLS). The final score for each patient was calculated as a linear combination of gene expression weighted by model-derived coefficients:

$$}_ = \sum\limits_}^ } \cdot x_}$$

where βj is the weight of gene j and xij is its normalized expression in patient i.

2.5 Evaluation of prognostic and therapeutic value

Patients were stratified into high- and low-CMLS groups based on the optimal cutoff identified by the survminer package. Prognostic value was assessed using Kaplan–Meier survival analysis, log-rank test, and multivariate Cox regression adjusting for clinical covariates. To evaluate robustness, the performance of CMLS was compared against 22 previously reported signatures using the C-index.

Immunotherapy response was predicted using multiple approaches. Tumor microenvironment profiles were characterized using the IOBR package with published immune-related gene sets [20]. Tumor mutational burden (TMB), neoantigen burden (TNB), and M1 macrophage levels were compared across CMLS groups. Immune checkpoint blockade response was predicted using TIDE (http://tide.dfci.harvard.edu), SubMap (https://www.genepattern.org), and the TIP pipeline (http://biocc.hrbmu.edu.cn/TIP/).

2.6 Drug sensitivity and pathway analysis

The Oncogenic signaling pathways were explored using GSEA [21]. Drug response prediction was performed using the PRISM and CTRP v2.0 pharmacogenomic datasets, matched to the CMLS expression profile. Drug sensitivity was quantified by area under the curve (AUC) values, and differential responses between CMLS groups were assessed using the pRRophetic and oncoPredict packages.

2.7 Statistical analysis

All analyses were conducted in R (version 4.3.1). Statistical comparisons between groups were performed using Student’s t-test or Wilcoxon rank-sum test for continuous variables, and chi-square or Fisher’s exact test for categorical variables. Survival differences were assessed using log-rank tests and Cox proportional hazards regression. For all analyses involving multiple testing (e.g., GSVA, drug screening), false discovery rate (FDR) correction was applied using the Benjamini–Hochberg method. Results were considered statistically significant at FDR < 0.05 unless otherwise stated.

View original article

HORMONES & CANCER / DISCOVER ONCOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Integrated multiomics analysis and machine learning refine molecular subtypes and prognosis for thyroid cancer

Comments (0)