We used the Genotype-Tissue Expression (GTEx) v8 data release [21] (Fig. 1, Additional file 1: Table S1). We selected 46 tissues with at least 80 samples comprising 12,654 RNA-sequencing samples from 717 donors, including 316 smokers, 253 never smokers, and 148 ex-smokers (Fig. 1). All human donors were deceased, with informed consent obtained via next-of-kin consent for the collection and banking of deidentified tissue samples for scientific research. The research protocol was reviewed by Chesapeake Research Review Inc., Roswell Park Cancer Institute’s Office of Research Subject Protection, and the institutional review board of the University of Pennsylvania.
Fig. 1Individuals, tissues, and data modalities analyzed. For all tissues, RNA sequencing and histological images are available. Tissues highlighted in bold also include DNA methylation profiles. We classified GTEx individuals according to their smoking status
Smoking annotationThe smoking annotation was part of the GTEx v8 protected data stored in dbGap (accession number phs000424.v8.p2) and was obtained either from the donors’ medical record or information provided by the donors’ next-of-kin [22]. We manually curated the annotation and classified donors into never smokers, ex-smokers, and smokers. To do so, we used the original MHSMKSTS variable, which encoded information on whether the donor ever smoked. We excluded 25 donors with unknown smoking status. Only cigarette-smoking donors were considered according to the variable MHSMKTP, so we excluded 22 donors annotated as smokers of “Cigar,” “Pipe,” or “Other.” We revised the manual comments under MHSMKCMT to distinguish individuals that quit smoking, which we reannotated as ex-smokers. The variable MHSMKCMT is free text annotated by the donors’ next-of-kin, where the maximum time between smoking cessation and death is more than 30 years, and the minimum is 1 month. From the MHSMKCMT annotations, we further excluded 43 donors with unclear annotations. These excluded annotations include more smokers of non-cigarettes, individuals that smoked for a very short period of time in the past, and intermittent smokers. We restricted the analysis to tissues with at least 80 samples with annotation for all the variables included in our models. In total, we analyzed 11,962 RNA-seq samples from 46 human tissues and 717 post-mortem donors. From the 717 donors, 316 are smokers that smoked a median of 20 cigarettes per day (age mean = 49.9, sd = 13.3), 148 ex-smokers (age mean = 56.9, sd = 11.7), and 253 never smokers (age mean = 52.6, sd = 12.7).
Gene and alternative splicing event quantificationGene and transcript quantifications were based on the GENCODE 26 release annotation (https://www.gencodegenes.org/human/release_26.html) [23]. We downloaded gene counts and TPM quantifications from the GTEx portal (https://gtexportal.org/home/downloads/adult-gtex) [24]. We selected genes with the protein-coding and lincRNA biotype on the GTEx GENCODE v26 gtf file. For the expression analysis, we considered expressed genes per tissue (TPM ≥ 0.1 and ≥ 6 reads in ≥ 20% of tissue samples), excluding genes in the pseudoautosomal region (PAR). In total, we analyzed 26,676 genes (19,255 protein-coding and 7421 lincRNA) across tissues. For the splicing analysis, we downloaded transcripts TPM quantifications from the GTEx portal (https://gtexportal.org/home/downloads/adult-gtex) [24] and we used SUPPA2 [25] to calculate percentages of splicing inclusion (PSI) for 7 different types of splicing events: skipped exon, mutually exclusive exons, alternative 3 prime, alternative 5 prime, retained intron, alternative first exon, and alternative last exon. Specifically, we used SUPPA2 to first generate the dictionary of splicing events from the GENCODE v26 annotation and then computed their PSI values for each sample and splicing event. Each splicing event is defined by a set of isoforms: those that include the exonic/intronic sequence (spliced-in isoform) and those that either exclude it or include an alternative exonic sequence (spliced-out isoform). We used the following criteria to select the alternatively spliced events (ASEs) in each tissue: events in protein-coding and lincRNA genes expressed in each tissue; events quantified in all tissue samples; we excluded events with low complexity (fewer than 15 PSI unique values) or insufficient variability (the most common event is 4 times more common than the second most common); we kept events from expressed isoforms (TPM ≥ 0.5 in ≥ 20% of tissue samples for both the most abundant spliced-in and spliced-out isoforms) and with a quantifiable contribution of the traits of interest (see Hierarchical partition analysis). In total, we tested 57,002 alternative splicing events (16,704 skipping exons, 1142 mutually exclusive exons, 6491 alternative 5′ splice sites, 7412 alternative 3′ splice sites, 3919 intron retentions, 16,681 alternative first exons, and 4653 alternative last exons) across tissues.
Differential expression analysisTo identify differentially expressed genes (DEGs) we used linear-regression models following the voom-limma pipeline [26, 27]. We corrected for technical covariates routinely included in previous GTEx publications [28, 29]. These covariates are related to parameters of donor death (Hardy scale), ischemic time, RNA integrity number (RIN), and sequencing quality control metrics (Exonic rate). We further controlled for unknown sources of variation, mainly related to differences in tissue composition and sequencing batch, by including the first two PEER factors, as PEER1 mostly correlates with cell type heterogeneity (see Fig. S4 A from [30]) and PEER2 with sequencing batch (see Fig. S9a from [31]). Furthermore, conversely to eQTL discovery, the effect of including additional PEER factors on the DEGs discovery is variable across tissues and leads to reduced power to detect expression differences [32]. We also included in the model four demographic traits: the donors’ genetically inferred ancestry [21] (either European American, African American or admixed donors), sex (either male or female), age, and body mass index (BMI). Ancestry and sex were treated as categorical variables, whereas age and BMI were treated as continuous variables.
We also corrected for clinical traits obtained from histopathological annotations, publicly accessible from the GTEx portal [33, 33]. However, we only included in the final models the clinical traits with more than 20 diseased individuals associated with changes in gene expression in a simpler model. We first modeled each clinical trait per tissue, correcting by the previously mentioned covariates, and kept the ones that retrieved at least 5 DEGs (Additional file 2: Table S2). Finally, we ran one model per tissue, correcting by all the technical covariates, the four demographic traits, the chosen clinical traits, and smoking.
We compared log-cpm gene expression values and evaluated the statistical significance of smoking and the four demographic traits. Smoking was defined as a categorical variable with three levels: never smokers, ex-smokers, and smokers, and all possible contrasts were evaluated. We corrected all analyses for multiple testing using the false discovery rate (FDR) through the Benjamini–Hochberg method and considered genes differentially expressed at an adjusted p-value below 0.05.
$$log\mathit2\left(cpm\right)\sim HardyScale+IschemicTime+RIN+ExonicRate+PEER1+Peer2+Clinical\;traits+Ancestry+Sex+Age+BMI+Smoking$$
When downsampling, we used the same models using an equal number of never smokers and smokers in sets of 50 different permutations. We first performed 50 iterations, downsampling to 36 never smokers and 36 smokers, which is the minimum number needed to evaluate all tissues. We then downsampled to 50, 80, and 100 samples per category. To report the results, we computed the mean across the 50 permutations.
To validate the genes differentially expressed between smokers and never-smokers, we compared our results to other studies that used independent transcriptome datasets. To obtain a list of previously reported smoking-DEGs in lung, we downloaded and parsed Table S1 from Bosse et al. [34], Supplementary Table S3 A and S3B from Landi et al. [35], and Table S1 from Pintarelli et al. [36]. For blood, we downloaded and parsed Table S2 from Huan et al. [37] and Table S2 from Vink et al. [38]. For adipose tissue, we downloaded and parsed Table 2 from Tsai et al. [17]. To assess the replicability of our results, we identified smoking-DEGs with GTEx data using similar filtering criteria (logFC and FDR thresholds) as those used in each paper. We performed a two-tailed Fisher’s exact test to test if the number of DEGs we identified significantly overlapped with the genes identified in previous studies. A gene was only considered overlapping if it was DE in the same direction in both studies. We used the protein-coding and lincRNA genes expressed in the respective tissue as background.
Additive and interaction effectsIn order to investigate additive effects, we conducted one-tailed Fisher’s exact tests on the common DEGs with smoking and each demographic trait per tissue. We used as background the genes expressed in each tissue (TPM ≥ 0.1 and ≥ 6 reads in ≥ 20% of tissue samples). We ran the analysis per tissue and for each pair combination of the smoking variable and a demographic variable. The cross-tabulations are as follows using age as the demographic trait of example:
Age-DEGs
Non age-DEGs
Smoking-DEGs
X
Z
Non smoking-DEGs
Y
Background − X − Y − Z
where X would correspond to the number of smoking-age-DEGs, Y would correspond to the number of age DEGs minus the number of smoking-age-DEGs, and Z would correspond to the number of smoking DEGs minus the number of smoking-age-DEGs. We derived the odds ratio (OR) from these tables computed as follows:
$$\mathrm O\mathrm R\;=\frac-X-Y-Z)}$$
The interpretation of the odds ratio is as follows: if the OR is higher than 1, age-DEGs and smoking-DEGs overlap more often than expected if the two variables were independent. If the OR is lower than 1, the two variables overlap less than expected by chance.
The p-values were corrected using the Benjamini–Hochberg method, correcting for all the tests performed across tissues per demographic trait. We considered significant effects with a false discovery rate (FDR) below 0.05. To then check if the direction of change was concordant between smoking and the demographic traits, we conducted a chi-square test and corrected for multiple testing in the same way. In this case, from the subset of smoking-age-DEGs, the contingency table would be as follows:
Age-DEGs
Upregulated
Downregulated
Smoking-DEGs
Upregulated
X
Z
Downregulated
Y
B
To investigate interactions, we expanded the linear models described in the section “ Differential expression analysis” in each tissue, adding interaction terms between smoking and the other demographic traits (Ancestry:Smoking + Sex:Smoking + Age:Smoking + BMI:Smoking), for the smoking contrast that compares never smokers and smokers. For an interaction to be tested in a tissue, we required (1) previously found DEGs with both traits involved in the interaction term, and (2) sufficient sample size, which we define as at least 20 samples per group. To determine the number of samples in each combination, we categorized the continuous variables. Age was split into two groups: young (age < 45) and old (age ≤ 45). BMI was divided into three groups: normal (BMI < 25), overweight (25 ≤ BMI < 30), and obese (BMI ≥ 30).
In order to identify the interaction effects between smoking status and genotype on gene expression, we followed the steps in [39]. For each of the gene-variant pairs identified as independent cis-eQTLs in GTEx v8 [21], we ran a linear regression model including genotype, smoking status, and covariates to test the effect of the interaction term Genotype × Smoking on gene expression separately in each tissue. We tested only the gene-variant pairs that had at least 3 samples in every combination. Each tissue was corrected for a different number of PEER factors based on the tissue sample size, as described in the GTEx eQTL analysis [21]. For implementation, we used tensorQTL [40], where we could model separately for each gene and account for multiple testing errors using Benjamini–Hochberg correction with FDR < 0.05.
Differential splicing analysisTo perform differential splicing analysis, we used a similar method to the one described in [32] to allow both a direct comparison with the differential gene expression analysis and a subsequent quantification of the alternative splicing variation explained by each trait (see Hierarchical partition analysis). In short, we modeled percentage of spliced-in (PSI) values with fractional regressions. This method is suited to work with bounded values that can assume the extremes, as is the case for PSI values in 0 and 1. Specifically, we used the R glm function from the R package stats [41] setting family = ‘quasibinomial (“logit'’)’ as a parameter. For each splicing event within each tissue, we fitted logit transformed PSI values with the same covariates used in differential expression analysis and evaluated the statistical significance of smoking and the demographic traits of interest: ancestry, sex, age, and BMI.
$$PSI\;values\sim HardyScale+IschemicTime+RIN+ExonicRate+PEERI+PEER2+Ancestry+Sex+Age+BMI+Smoking$$
To calculate robust standard errors for our coefficients, we used the vcovHC function from the R package sandwich [42] with type = "HC0". To do multiple comparisons for the variables with three levels (e.g., to compute the DSEs between never smokers and smokers, between ex-smokers and smokers, and between never smokers and ex-smokers), we used the function glht from the package multcomp [43], with Tukey’s all-pair comparisons for the mcp function. We corrected all analyses for multiple testing using false discovery rate (FDR) through the Benjamini–Hochberg method implemented in the R package stats [41]. We considered events differentially spliced at an adjusted p-value (FDR) below 0.05.
To investigate the functional consequences of DSEs, we first identified the isoforms that contribute to each splicing event. From these sets of isoforms, we selected the most abundant isoforms per tissue that include (spliced-in) and exclude (spliced-out) the splicing event. Depending on the biotype of these two isoforms, DSEs can then be associated with a switch between protein-coding isoforms, a switch between a protein-coding and a non-coding isoform, or a switch between non-coding isoforms. For those events with a switch between protein-coding isoforms, we executed the pipeline from [32] to analyze the functional consequences of the switches based on functional domains from Pfam [44]. The percentage of events that belong to each switch type was computed using the sum of the events across tissues from the given switch type over the total number of events, without considering repeated events present due to tissue sharing. To test if smoking tends to be associated with an increase in non-coding isoforms, we used a binomial test with a probability of 0.5 as the null hypothesis.
Functional enrichmentWe used the R package clusterProfiler [45] for the different overrepresentation enrichment analyses (ORA) conducted throughout the paper. For each ORA, we defined as background all the expressed genes in the given tissue and analyzed upregulated and downregulated genes independently. We corrected for multiple testing by using the Benjamini–Hochberg method and considered as significant gene sets with an FDR < 0.05. We performed this analysis across several ontologies, including gene ontology (GO), KEGG, and disease ontology (DO). To summarize the enriched GO terms, we employed the orsum package for up- and down-regulated terms separately [46]. The analysis was performed by considering all the enriched terms of each tissue at the same time, allowing for an easier comparison between enrichment results.
Machine learning on histological imagesWe used the histological images of lung, thyroid, pancreas, and esophagus mucosa available in the GTEx Histological Image Viewer (https://gtexportal.org/home/histologyPage) [33], with the aim of building tissue-specific classifiers of smokers and never smokers. For each donor, we downloaded the respective whole slide images (WSI) and divided them into tiles of size 512 × 512 × 3 using PyHist [47] with 2 × for the downsample parameter. We kept the tiles with at least 85% tissue content.
We divided the dataset into train and test with the test set being composed of a subset of donors (n = 85) common between the 4 tissues. This step ensures a more robust model comparison. In each tissue, we used 80% of the train subjects for model training, while we kept the remaining 20% for model validation. For modeling we used the pre-trained convolutional neural network (CNN) Xception [48] changing only the top layers by adding a Max Pooling [49], a Dropout [49], and a sigmoid activation [49] to get the probability of each subject being a smoker. All models were compiled utilizing a weighted binary cross entropy as loss function, and an Adam algorithm with learning rate of \(1\times ^\) as an optimizer for the gradient descent. For each tissue we adjusted different class weights in order to mitigate the impact generated by the imbalanced distribution of cases in the final results (see GitHub). The best model for each tissue was selected based on AUC.
Thyroid follicle detectionWe developed a custom CellProfiler [50] pipeline to detect and measure thyroid follicles. The first step consists of the computational application of the Ponceau-Fuchsin (PF) stain to the original Hematoxylin and Eosin (H&E) stained image tiles. The PF stain hue absorption helps isolate colloid-containing follicles when using adaptive thresholding in grayscale. We further manually adjusted IdentifyObject module parameters in order to identify the separate follicles. On identified follicles, we used the ObjectMeasurement module to calculate area, shape, and size statistics on all thyroid tiles.
Cell-type deconvolutionIn order to perform cell-type deconvolution of the GTEx bulk dataset, we first obtained the processed single-cell dataset from GEO GSE173896 [51]. This dataset integrated 12 individuals (6 smokers, 3 never smokers and 3 ex-smokers) and > 57,000 cells classified into 30 discrete cell types.
Before deconvolution, we processed the single-cell dataset by removing rare cell types (cell types with less than 50 cells across the dataset), and only protein-coding and lincRNA genes in common with the bulk dataset were kept. Next, we used the cleanup.genes function from the BayesPrism R package [52] to further exclude ribosomal, mitochondrial, and sex-chromosome genes, as per package instructions, as well as genes expressed in less than 50 cells. We merged subcell types into unified cellular populations, and identified marker genes using the get.exp.stat function. Marker genes with a log fold change > 0.1 and p-values < 0.01 were selected. Finally, cellular deconvolution was performed based on the expression of these genes.
To compare the macrophage populations between smokers’ groups, while controlling for other sources of variation, we fitted a beta regression model on the macrophage proportions in each individual:
$$CelltypeProportion\sim HardyScale+IschemicTime+Ancestry+Sex+Age+BMI+atelectasis+emphysema+fibrosis+pneumonia+Smoking$$
We evaluated the statistical significance of smoking using the glht function from the package multcomp [53]. We considered the difference significant at a nominal p-value < 0.05.
Differential methylation analysisWe downloaded normalized beta counts of the 754,054 CpGs from the Infinium MethylationEPIC array generated in [54]. We correlated smoking status with their provided PEERs and found significant correlations with PEER3 and PEER4 (Additional file 3: Fig. S4a). Hence, we only corrected for PEER1 and PEER2. We used limma [27] to run linear models on M values [55] and corrected for the following set of covariates to be as similar as possible to the previous analysis:
$$M\;values\sim HardyScale+Ischemic+PEER\mathit1\mathit+PEER\mathit2\mathit+Clinical\mathit\;traits\mathit+Ancestry\mathit+Sex\mathit+Age\mathit+BMI\mathit+Smoking$$
To perform functional enrichment with the EPIC array, we need to take into account that some genes contain more probes than others. For this goal, we used the function gometh from the missMethyl package [56] to get GO:BP terms using as background all 754,054 positions studied from the array.
The overlap between smoking and the other demographic traits was followed in the same way as in the “ Differential expression analysis” section. We also tested interactions but did not find any significant results.
To validate the DMPs between smokers and never smokers, we compared our results to those obtained with independent datasets using arrays. For lung, we used the results in Table 1 obtained using the 450 K array from [57], we downloaded the EPIC array data available in Table S2 from [58], and the 450 K cancer results in Table S1 from [59]. For blood, we used the data on the 450 K from [60] available in Table S2, the data on EPIC in Table S1 from [61], and the data on http://mimeth.pasteur.fr/ from [62]. For the adipose tissue, we downloaded Table S1 from [17] on the 450 K array. We performed two-tailed Fisher’s exact tests to check if the DMPs we identified overlapped with the positions observed in previous studies, with the same direction of effect. The background was either the set of studied positions from the EPIC array or the subset of positions available in the EPIC array that were also probed in the 450 K array.
We tried to identify interaction effects between smoking status and genotype on DNA methylation following similar steps to the “ Differential expression analysis” section, but we did not find significant results after Benjamini–Hochberg correction with FDR < 0.05.
Annotation of DMPsThe DMPs were classified depending on their location at promoters, enhancers, gene bodies, or intergenic, based on the annotations provided in the EPIC v1.0 array manifest b5. Positions annotated as “Promoter_Associated” under “Regulatory_Feature_Group” or under “TSS200” or “TSS1500” under “UCSC_RefGene_group” were assigned as promoters. From the rest, the ones with any value in “Phantom5_Enhancers” were assigned as enhancers. From the rest, the positions including “Body,” “1 stExon,” “ExonBnd,” “3UTR,” or “5UTR” under “UCSC_RefGene_group” were annotated as gene bodies. The rest were annotated as intergenic. We also classified the DMPs into CpG islands, CpG shore, CpG shelf, or open sea, using the variable “Relation_to_UCSC_CpG_Island” from the EPIC manifest. To study the enrichment of smoking-DMPs or smoking-age-DMPs in the different genomic locations, we performed Fisher’s exact test for each region separately for hypomethylated and hypermethylated DMPs and adjusted for multiple testing using Benjamini–Hochberg correction.
To annotate the chromatin states around each array probe, we used the 18 chromatin states inferred with ChromHMM [63, 64] for the lung sample BSS01190 generated by the ROADMAP Epigenomics consortium [63] and analyzed for EpiMap [65]. TssFlnkD, TssFlnk, and TssFlnkU are reported together under “Flanking TSS,” EnhA1 and EnhA2 under “Active enhancer,” and EnhG1 and EnhG2 under “Genic enhancer.” For the blood analysis, we used the PBMC sample BSS01419.
The data was downloaded from https://personal.broadinstitute.org/cboix/epimap/ChromHMM/observed_aux_18_hg19/CALLS/ [66]. To perform enrichment of DMPs in the different chromatin states, we performed Fisher’s exact test similarly to the analysis on the different genomic locations.
To study the enrichment of transcription factor binding sites around DMPs, we downloaded the processed CHIP-seq data on transcription factors for the human v19 lung available in ChipAtlas [67]. We performed Fisher’s exact test for each transcription factor separately for hypomethylated and hypermethylated DMPs, and adjusted for multiple testing using the Benjamini–Hochberg correction. The transcription factors that are part of the Polycomb repressive complex available in the CHIP-seq data were EZH12, SUZ12, YY1, KDM2B, REST, PCGF2, CBX2, CBX8, RNF2, and RYBP.
To study the enrichment in causal CpG, we downloaded the causal CpG reported by Ying et al. [68]. For the set of CpG associated with each trait, we computed the significance of the overlap with smoking-CpG and aging-CpG, using Fisher’s exact test. To test the direction of the effect, we classified each causal CpG as protecting or damaging, according to Ying et al., by checking the direction of the product of the Mendelian randomization effect (b) and age effect (B). Sites with b*B > 0 were defined as protective, whereas sites with b*B < 0 were defined as damaging. We conducted an overlap analysis with the smoking-DMPs and aging-DMPs using the Fisher test. Hyper- and hypomethylated positions were tested independently, and p-values were adjusted for multiple testing.
Correlation between DNA methylation and gene expressionThe CpG-gene pairs were retrieved first from the EPIC v1.0 manifest b5. We considered a probe to be part of a pair for every gene annotated under “UCSC_RefGene_Name.” For the probes annotated as a promoter or enhancer that did not have any assigned genes in the manifest, we annotated the closest gene using the function matchGenes from bumphunter [69]. Note that we will be using the terminology “CpG-associated gene” for the following definitions: either there is a known previous association reported by Illumina in the EPIC manifest (98% of CpG-gene pairs) or the gene is the closest to a promoter or enhancer (2%). We computed Pearson’s correlation for the DMP-DEG pairs and corrected for multiple testing using the Benjamini–Hochberg method. We considered significant correlations with an FDR < 0.05. To compute the percentage of DEGs that significantly correlate with DMPs, we considered DEGs associated with at least one probe in the array.
Multi-Omics Factor Analysis (MOFA)Gene expression and DNA methylation data derived from the EPIC array from the lung tissue of smokers and never smokers were integrated using MOFA [70]. Before the analysis, methylation features were separated in promoters, enhancers, gene body and intergenic features based on the annotations provided in the EPIC v1.0 manifest, as previously described. Each set of methylation features were treated as distinct data modalities. Before data integration, gene expression count data was normalized using variance stabilizing transformation, using the vst function from DESeq2 [71] package, while methylation data was transformed into M-values as previously described. For each data modality, we selected the top 5,000 most variable features. Data was scaled prior to model training. We allowed MOFA to select 12 latent factors, with factors explaining less than 0.01 variance being pruned.
After model training, the latent factors were correlated (Spearman rho) with demographic and technical traits (sex, age, BMI, smoking status, and PEER factors). We considered a latent factor significantly correlated with a trait at an FDR < 0.05, after multiple test corrections across all demographic trait–latent factor combinations.
Enrichment analysis of MOFA factorsMOFA infers a low-dimensional representation of the data by identifying latent factors that capture the primary sources of variability across multiple modalities. This process generates a weight matrix, where each feature (i.e., gene or CpG) is assigned a weight for each latent factor, reflecting its contribution to the underlying factor. We leveraged this weight matrix across data modalities and performed gene set enrichment analysis for GO terms using the run_enrichement function from the MOFA package (Argelat et al.). We performed the analysis for positive and negative weights independently. To test for functional enrichment in CpGs, we used their associated gene in the “UCSC_RefGene_Name” column from the EPIC v1.0 manifest b5. We corrected for multiple testing in each analysis and considered terms significant at an FDR < 0.05. Finally, enriched terms across all analyses were summarized using the orsum package [
Comments (0)