Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study

Introduction

Machine learning is increasingly being used in health care to analyze patient data and provide insights on improving health outcomes and the quality of care []. With the rise of electronic health data (EHD) and entering a large amount of data per patient in hospitals, there are big opportunities to train machine learning models for a variety of applications, such as the prediction or diagnosis of diseases, outcome prediction, and treatment planning [,]. EHD are a valuable source of information on a patient, containing details on their demographics, hospital visits, medical diagnoses, physiological measurements, and treatments received []. However, despite the opportunities offered by these large data sets, there are challenges in terms of data quality, privacy, and the complexity of medical conditions []. In terms of machine learning, EHD can include many irrelevant and redundant features, where their direct use can lead to the “curse of dimensionality” as the high dimensionality of the data can make it more difficult to extract meaningful patterns and relationships []. Therefore, it is important to apply appropriate techniques for dimensionality reduction and feature engineering to address this challenge and improve the effectiveness of predictive models built from EHD.

Feature selection is one of the critical aspects of machine learning. It involves selecting a subset of relevant features that are the most useful for predicting a target variable. In the case of medical data, these features could include patient demographics, medical history, laboratory test results, and diagnosis codes []. Feature selection is essential because it can help improve the accuracy and performance of machine learning models by reducing the number of irrelevant or redundant features and avoiding overfitting []. Unsupervised feature selection is a type of feature selection method that is used when there is no target variable available to guide the selection of features. Unlike supervised feature selection, which chooses features that better predict a certain target variable, unsupervised feature selection methods rely on the intrinsic structure of the data to identify the most important features. This behavior helps the selected features to be unbiased and perform well when there are no labeled data. It can also reduce the risk of overfitting to a certain target variable and ensure robustness to new target variables []. This is an important advantage in health care, where collecting labeled data is usually difficult and the same data are often used to predict multiple target variables.

Generally, there are 3 main categories of feature selection methods: filter, wrapper, and embedded methods. Filter methods use statistical tests such as variance to rank individual features within a data set and select the features that maximize the desired criteria. However, they usually lack the ability to consider the interactions between features []. Wrapper methods, on the other hand, select features that optimize an objective function for a clustering algorithm. Therefore, these methods are generally specific to particular clustering algorithms and may not be suitable for use with other algorithms. Wrapper methods can detect potential relationships between features, but this often results in increased computational complexity []. Embedded methods also take into account feature relationships but generally do so more efficiently by incorporating feature selection into the learning phase of another algorithm. Lasso regularization is one of the well-known embedded methods that can be applied to a variety of machine learning models [].

The International Classification of Diseases, Tenth Revision (ICD-10) is a method of classifying diseases that was created by the World Health Organization and is used internationally []. It categorizes diseases based on their underlying cause, characteristics, symptoms, and location in the body and uses codes to represent each disease. The ICD-10 system organizes thousands of codes in a hierarchical structure that includes chapters, sections, categories, and expansion codes. Within this structure, section codes and their corresponding chapter codes can be thought of as child-parent relationships, with each ICD-10 code serving as a node in the classification system. The same relationship applies to categories and sections, as well as expansion codes and categories. The high number of codes in this system is one of the major challenges of using them in machine learning applications []. It is worth noting that Canada has added or changed some codes in the lower levels according to their health care system requirements (ICD-10, Canada [ICD-10-CA]) [].

Similar to International Classification of Diseases (ICD) codes, the Anatomical Therapeutic Chemical (ATC) classification system, developed by the World Health Organization Collaborating Centre for Drug Statistics Methodology, is an international tool for the active and systematic categorization of active pharmaceutical ingredients []. ATC codes are also structured hierarchically and are assigned based on the organ or system they impact, as well as their therapeutic, pharmacological, and chemical properties. This hierarchical system comprises 5 distinct levels, with the lower levels providing detailed information about the pharmacological subgroup and chemical substance, and the highest level representing the anatomical main group. As in the ICD-10, the ATC’s hierarchy introduces child-parent relationships at each level.

In this research, we used 3 administrative databases comprising ICD-10 and ATC codes pertaining to patients with coronary artery disease (CAD). These databases, relevant to acute care, ambulatory care, and pharmacy facilities, were used to select the most insightful codes characterizing this cohort.

MethodsData Set and Preprocessing

The Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH) registry [] is one of the most comprehensive data repositories of CAD management in the world, matching unique disease phenotypes with rich clinical information and relevant outcomes for patients in Alberta, Canada, who have undergone diagnostic cardiac catheterization or revascularization procedures. Our cohort’s patients were selected from the APPROACH registry. These patients underwent diagnostic angiography between January 2009 and March 2019 at 1 of the following 3 hospitals in Alberta: Foothills Medical Centre, University of Alberta Hospital, and Royal Alexandra Hospital. We excluded patients with ST elevation myocardial infarction from the study to focus on nonemergency CAD.

Discharge Abstract Database (DAD), National Ambulatory Care Reporting System (NACRS), and Pharmaceutical Information Network (PIN) data for the abovementioned patients were extracted from Alberta provincial health records. The DAD contains summary hospitalization information from all acute care facilities in Alberta. The NACRS includes all visits to ambulatory care facilities (ie, emergency department, urgent care, and day surgery visits) in the province as well as some nonabstracted data from other specialty clinics. The PIN is based on a system that collects all prescription medicine dispensations from pharmacies all over Alberta.

In the DAD and NACRS, for each patient, we aggregated all ICD-10-CA codes of hospital admissions or physician visits every 3 months following the first admission date (all codes in that period are treated as 1 record’s codes). This helps us to make sure that chronic diseases are captured more comprehensively in fewer records and reduce the effect of noisy records. We did a similar procedure for ATC codes in the PIN data set and aggregated the codes every 6 months, since most medications prescription refills did not extend beyond 6 months. We one-hot encoded the ICD-10-CA and ATC codes and their parent nodes for each record. For example, if the ICD-10-CA code “I251” was present, “I25,” “I20-I25,” and “Chapter IX” were also encoded in the one-hot table. Similarly, if the ATC code “C07AB02” was present, “C07AB,” “C07A,” “C07,” and “C” were also encoded. We show the number of all unique ICD-10-CA or ATC codes in the data set with NAll.

To validate the performance of the selected features in a real clinical problem, we pulled the mortality data of the patients enrolled in the cohort from the Vital Statistics Database and matched them with the aggregated records to determine 90-day mortality following the end of the last procedure.

Feature Selection

The following unsupervised algorithms were used for feature selection:

Concrete autoencoder (CAE) []: In this method, continuous relaxation of discrete random (concrete) variables [] and the Gumbel-Softmax reparameterization trick are used to construct a special layer in the neural network to transform discrete random variables into continuous ones, which allows for efficient computation and optimization using gradient-based methods. The reparameterization trick allows the use of a softmax function in this layer, which is differentiable, unlike the argmax function. This characteristic is useful for designing an autoencoder, in which features are selected in the concrete layer (as the encoder) through the softmax operation and a common neural network (as the decoder) is used to reconstruct the main feature space out of the selected features. During the training, a temperature parameter can be gradually decreased, allowing the concrete selector layer to try different features in the initial epochs and behave more similar to an argmax function in the last epochs to keep the best features. After training, we can use an argmax function on the weights to find the features passed to the neurons of the encoder layer. One of the major problems of this method is that it may converge to a solution where some duplicate features are selected in some neurons (ie, fewer than the desired number of features are selected).Autoencoder-inspired unsupervised feature selection (AEFS) []: This method combines autoencoder and group lasso tasks by applying an L1,2 regularization on the weights of the autoencoder. The autoencoder in this method tries to map the inputs to a latent space and then reconstruct the inputs from that space. The L1,2 regularization will optimize the weights (change them toward 0) to select a smaller number of features. The neural network structure of the autoencoder will enable the model to incorporate both linear and nonlinear behavior of the data in the results. After training this neural network, the features with higher weight values in the first layer can be selected as the most informative features. The authors claimed that this algorithm showed promising results in computer vision tasks.Principal feature analysis (PFA) []: This method selects features based on principal component analysis (PCA). The most important features are selected by applying a k-means clustering algorithm to the components of PCA and finding the features dominating each cluster (closest to the mean of the cluster). This algorithm is primarily designed for computer vision.Unsupervised feature selection for multicluster data (MCFS) []: This approach prioritizes the preservation of the multicluster structure of data. The algorithm involves constructing a nearest neighbor graph of the features, solving a sparse eigen-problem to find top eigenvectors with regard to the smallest eigenvalues. Then, an L1-regularized least squares problem is optimized to find the linear weights between the features and the eigenvectors. This allows us to define the MCFS score as the maximum weight of each feature across different clusters and select the highest scores as the best features.Laplacian score (LS) []: The LS algorithm uses the nearest neighbor graph to capture the local structure of the data in the affinity matrix. For each feature, its adjusted variation is calculated by removing the feature’s mean, normalized by a degree matrix, which itself is derived from the sum of similarities in the affinity matrix. The Laplacian matrix, essential for this calculation, is formed by subtracting the affinity matrix from the degree matrix. The significance of each feature is then assessed by the LS, which is the ratio of the feature’s ability to preserve local information (captured by its adjusted variation’s alignment with the Laplacian matrix) to its overall variance (measured by its alignment with the degree matrix). The lower the LS, the more relevant the feature for representing the intrinsic geometry of the data set.

We applied the LS, AEFS, PFA, MCFS, and CAE algorithms to a 67% training data set (split based on patients) of one-hot encoded features to select the best 100 features (NBest=100) with the following specifications (we chose NBest based on preliminary experimentations).

For the AEFS method, we used a single hidden-layer autoencoder and optimized the loss function as described in Han et al [], with α=0.001 as the trade-off parameter of the reconstruction loss and the regularization term and β=0.1 as the penalty parameter for the weight decay regularization. The choice of these parameters was based on preliminary experimentations on a small set of data and exploring α and β of .

For the PFA method, we used incremental PCA instead of the normal PCA in the original paper [], with a batch size of 2NAll due to the high computational cost. We decomposed the data to NAll2 components and then applied k-means clustering to find NBest clusters. We also tried as the number of components of the PCA in the preliminary experiments.

To use the LS and MCFS methods for feature selection, we used the Euclidean distances between features to construct a nearest neighbor graph G based on the 5 nearest neighbors. For the LS method, we set the weights of the connected nodes of G to 1, assuming a large t in the LS formulation. Then, we computed the LS for each feature and selected the top features with higher scores. Due to the high computational resources required for the LS and MCFS methods, we did not explore different parameters and used the same settings suggested by the implementation codes of these algorithms.

As the structure of the loss function allows us to prioritize some target variables, the CAE method was applied in 2 different ways—with and without adjusting weights for features. The reason for adjusting the weights is that since there are many correlated features in the ICD-10-CA and ATC code data sets, the model may choose one of them randomly []. Therefore, we applied the function in as the class weights of the features to the loss function of the model:

WF=11+dF(1)

where WF is the weight for feature F and d(F) is the depth of feature F as a node of the ICD-10-CA or ATC tree. This weight adjustment will force the model to give more importance to the features at the top of the tree and to generalize more in clinical settings. In the rest of the paper, this variant of the CAE model will be referred to as the CAE with weight adjustment (CAEWW) and the regular CAE model will be referred to as the CAE with no weight adjustment (CAENW).

We defined NBest neurons in the concrete selector layer and used a learning rate of 0.001, a batch size of 64, and 1000 epochs. We also controlled the learning of the concrete selector layer by the temperature parameter that started from 20 and decreased to 0.01 exponentially (this annealing strategy was suggested by Abid et al [] for better convergence). The decoder of the CAE was a feed-forward neural network with 2 hidden layers with 64 neurons and used a sigmoid activation function for the output layer and a leaky rectified linear unit activation function for the other layers. The learning rate, number of neurons, and the layers were determined based on preliminary experiments for the fastest convergence of the autoencoder.

Evaluation of Selected Features: Reconstruction of Initial Feature Space

To evaluate the effectiveness of the selected features, we trained a simple feed-forward neural network model using the chosen features to reconstruct the original feature space for each data set separately. The neural network consisted of 2 hidden layers, each with 64 neurons, and used leaky rectified linear unit activation functions, with a 10% dropout rate, in the hidden layers and a sigmoid activation function in the output layer. We trained the model using the same training set used in the feature selection step and evaluated its performance on the remaining 33% test set using binary cross entropy. We also calculated the accuracy of each feature selection method to determine which method produced the most accurate results. One of the challenges in comparing models with a large number of targets is that the accuracy values are inflated, because most of the targets are heavily imbalanced (ie, most of them were 0s) and the models were able to predict them easily. To circumvent this issue, we used a 2-tailed t test analysis and compared the accuracy values of the classes with the accuracy of a baseline model that simply outputs the mode of the training data for each class regardless of the input.

Evaluation of Selected Features: Prediction of 90-Day Mortality

To demonstrate the utility of using unsupervised feature selection methods in a supervised setting, we conducted a case study where we used the selected features from each method to predict 90-day mortality following the end of the last procedure for each data set separately. Since our data sets were highly imbalanced, with only ~6%, ~2%, and ~1% of the aggregated records for the DAD, NACRS, and PIN data set, respectively, leading to 90-day mortality, we upsampled the minority class using random sampling to balance the training sets. We then trained extreme gradient boosting (XGBoost) models using the training sets with 5-fold cross-validation to tune the hyperparameters for each model. We used the best models to predict the binary outcome variables on the test sets and measured their performances. XGBoost was selected for its efficiency with sparse data, which was crucial for our data sets. XGBoost’s regularization features help prevent overfitting []. Additionally, its ability to provide interpretable models through tree-based Shapley values aligns with our objective to not only predict mortality but also understand the contributing factors []. XGBoost’s scalability on multiple processors and speed (for both training [] and Shapley analysis []) are also beneficial for processing large volumes of data and complex model tuning. After training the mortality prediction models for each method and data set, we calculated tree-based Shapley values corresponding to the features. This allowed us to rank the importance of each feature and explain their roles in predicting mortality.

We have made the implementation code for the methods discussed available at our GitHub repository [].

Ethical Considerations

This study received ethics approval from the Conjoint Health Research Ethics Board at the University of Calgary (REB20-1879). Informed consent was waived due to the retrospective nature of the data and the large number of patients involved, making it impractical to seek consent from each patient. All data were deidentified. No compensation was provided to the participants as the study did not involve direct participant interaction.

ResultsData Set Description

summarizes the characteristics of the patients in the cohort at the time of their initial catheterization. The total numbers of patients with at least 1 record in the respective data sets, as well as the time ranges for each data set, are provided in . The aggregation procedure described in the Methods section reduced the number of records to the values listed in the “Aggregated Records” row, and the table also includes the total number of codes (unique ICD-10-CA or ATC codes and their parent codes) in a data set, along with the average number of codes per record. illustrates the percentages of the 20 most common ICD-10-CA and ATC codes within each processed data set. Within the data set, there were 9942 cases corresponding to a 90-day mortality, resulting in a 20% mortality rate in the cohort. The final aggregated data for each data set were split into 67% for the training sets and 33% for the test sets at the patient level.

Table 1. Key characteristics of the patients with CAD enrolled in the cohort.VariableOverall (N=51,506)Total population, n (%)51,506 (100)Sex, n (%)Female12,875 (25)Male38,631 (75)Age (years), mean (SD)66.09 (11.41)BMI (kg/m2), mean (SD)29.51 (7.45)Missing data, n (%)8449 (16.4)CAD type, n (%)Non-ST elevation myocardial infarction24,119 (46.83)Unstable angina10,671 (20.72)Stable angina9832 (19.09)Missing data6884.0 (13.37)Canadian Cardiovascular Society angina grade, n (%)II (slight limit)4688 (9.1)IVb7513 (14.59)IVa (hospitalized with acute coronary syndrome)21,117 (41)III (marked limit)2581 (5.01)IVc1627 (3.16)I (strenuous)1309 (2.54)Atypical698 (1.36)Other or missing data11,973 (23.25)Diabetes, n (%)No diabetes37,544 (72.89)Type II12,067 (23.43)Type I806 (1.56)Other1089 (2.11)Dyslipidemia, n (%)32,967 (64.01)Heart failure, n (%)3689 (7.16)Atrial fibrillation or flutter, n (%)1220 (2.37)Hypertension, n (%)32,264 (62.64)Angina, n (%)2559 (4.97)Family history of CAD, n (%)15,209 (29.53)Smoking, n (%)Never25,822 (50.13)Current11,196 (21.74)Past14,488 (28.13)Chronic lung disease, n (%)5318 (10.33)Cerebrovascular disease, n (%)2040 (3.96)Psychiatric history, n (%)1097 (2.13)Venous insufficiency, n (%)476 (0.92)Alcohol consumption, n (%)599 (1.16)Extent of CAD, n (%)3 VDs247 (0.48)3 VDs (one >75%)7765 (15.08)3 VDs (>75% proximal LAD)5704 (11.07)3 VDs (proximal LAD)3318 (6.44)2 VDs5392 (10.47)2 VDs (>75% LAD)569 (1.1)2 VDs (both >75%)5215 (10.13)2 VDs (>75% proximal LAD)2819 (5.47)1 VD (>75% proximal LAD)2299 (4.46)1 VD (>75%)8504 (16.51)1 VD (50%‐75%)4032 (7.83)Severe left main disease3058 (5.94)Left main disease2584 (5.02)

aCAD: coronary artery disease.

bVD: vessel disease.

cLAD: left anterior descending.

Table 2. Summary statistics of the DAD, NACRS, and PIN data sets.Summary statisticsData setDADNACRSPINPatients with at least 1 record, n49,07550,62849,052Records, n273,9103,974,40328,807,136Aggregated records, n166,083173,507997,997Unique ICD-10-CA or ATC codes and their parent codes, n965178032315Codes per aggregated record, mean (SD)24.90 (16.55)15.27 (12.55)33.31 (18.95)Time range2004‐20222010‐20222004‐2022

aDAD: Discharge Abstract Database.

bNACRS: National Ambulatory Care Reporting System.

cPIN: Pharmaceutical Information Network.

dICD-10-CA: International Classification of Diseases, Tenth Revision, Canada.

eATC: Anatomical Therapeutic Chemical.

Performances of the Feature Selection Methods

shows the accuracies and binary cross entropies of the models based on the selected features from each method. shows the accuracy, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) metrics of the XGBoost models to predict 90-day mortality.

Table 3. Average accuracy and binary cross entropy (BCE) loss of different sets of selected features in reconstructing the original feature space in a neural network structure.Feature selection methodDADNACRSPINAccuracy, mean (95% CI)BCE, mean (95% CI)Accuracy, mean (95% CI)BCE, mean (95% CI)Accuracy, mean (95% CI)BCE, mean (95% CI)CAEWW0.9992 (0.9992-0.9993)0.0121 (0.0121-0.0121)0.9994 (0.9994-0.9995)0.0091 (0.0091-0.0091)0.9972 (0.9969-0.9975)0.0432 (0.0432-0.0432)CAENW0.9992 (0.9991-0.9993)0.0121 (0.0121-0.0121)0.9994 (0.9993-0.9994)0.0094 (0.0094-0.0094)0.9972 (0.9969-0.9974)0.0438 (0.0438-0.0438)AEFS0.9976 (0.9972-0.9980)0.0370 (0.0370-0.0370)0.9982 (0.9979-0.9985)0.0274 (0.0274-0.0274)0.9884 (0.9867-0.9901)0.1794 (0.1794-0.1794)MCFS0.9991 (0.9990-0.9991)0.0145 (0.0145-0.0145)0.9992 (0.9992-0.9993)0.0117 (0.0117-0.0117)0.9956 (0.9951-0.9962)0.0677 (0.0677-0.0677)PFA0.9975 (0.9971-0.9979)0.0382 (0.0382-0.0382)0.9981 (0.9978-0.9985)0.0286 (0.0286-0.0286)0.9871 (0.9852-0.9891)0.1982 (0.1982-0.1982)LS0.9989 (0.9988-0.9990)0.0165 (0.0165-0.0165)0.9991 (0.9990-0.9992)0.0136 (0.0136-0.0136)0.9945 (0.9938-0.9952)0.0850 (0.0850-0.0850)Mode of the training set (baseline model)0.9975 (0.9971-0.9979)0.0384 (0.0322-0.0447)0.9981 (0.9978-0.9984)0.0294 (0.0245-0.0342)0.9870 (0.9850-0.9889)0.2012 (0.1712-0.2312)

aDAD: Discharge Abstract Database.

bNACRS: National Ambulatory Care Reporting System.

cPIN: Pharmaceutical Information Network.

dCAEWW: concrete autoencoder with weight adjustment.

eSignificantly different from the baseline model that outputs the mode of each class (P<.05). The P values are presented in Table S1 of .

fCAENW: concrete autoencoder with no weight adjustment.

gAEFS: autoencoder-inspired unsupervised feature selection.

hMCFS: unsupervised feature selection for multicluster data.

iPFA: principal feature analysis.

jLS: Laplacian score.

Table 4. Performance of the extreme gradient boosting (XGBoost) model in predicting 90-day mortality using different sets of selected features.Feature selection methodDADNACRSPINAccuracyF1-scoreAUC-ROCAccuracyF1-scoreAUC-ROCAccuracyF1-scoreAUC-ROCCAEWW0.860.370.870.850.150.750.840.10.82CAENW0.860.360.870.860.150.750.830.10.82AEFS0.880.210.610.90.090.560.850.080.69MCFS0.860.370.890.840.130.740.810.090.84PFA0.920.050.50.970.020.50.930.050.54LS0.770.260.810.80.110.730.760.080.82

aDAD: Discharge Abstract Database.

bNACRS: National Ambulatory Care Reporting System.

cPIN: Pharmaceutical Information Network.

dAUC-ROC: area under the receiver operating characteristic curve.

eCAEWW: concrete autoencoder with weight adjustment.

fCAENW: concrete autoencoder with no weight adjustment.

gSignificantly different from the AUC-ROC of the model trained on CAEWW features in their corresponding data set (P<.05) using the DeLong test []. The P values are presented in Table S2 in .

hAEFS: autoencoder-inspired unsupervised feature selection.

iMCFS: unsupervised feature selection for multicluster data.

jPFA: principal feature analysis.

kLS: Laplacian score.

Both tables indicate that the CAE methods generally selected superior features compared to the other algorithms. Adjusting the weights within the CAE improved the performance of feature space reconstruction slightly. In terms of predicting 90-day mortality, the CAE methods again performed better than the other methods, as evidenced by the AUC-ROC. This superior performance was statistically significant in most instances (P<.05), according to the DeLong test for AUC-ROC. Furthermore, the McNemar test revealed a significant difference between the overall performance of the mortality prediction models trained on the features of the CAEWW method and those based on the other methods (P<.05). The P values of the McNemar and DeLong tests can be found in Table S2 of .

shows the log-scale histograms of the original feature space reconstruction accuracy in each ICD-10-CA or ATC code for different feature selection methods. It shows that CAEWW and CAENW were the best methods in terms of reconstructing the majority of the features with high accuracy. The other methods, despite having high average accuracy, performed poorly in reconstructing some of the features.

‎

Figure 1. Log-scale histograms of the initial feature space reconstruction accuracy in each International Classification of Diseases (ICD) code for different feature selection methods and different data sets: (A) concrete autoencoder with weight adjustment (CAEWW), (B) concrete autoencoder with no weight adjustment (CAENW), (C) autoencoder-inspired unsupervised feature selection (AEFS), (D) unsupervised feature selection for multicluster data (MCFS), (E) principal feature analysis (PFA), and (F) Laplacian score (LS). DAD: Discharge Abstract Database; NACRS: National Ambulatory Care Reporting System; PIN: Pharmaceutical Information Network. Characteristics of the Selected Features

We also calculated the average depths of the codes selected by each method and compared them (using 2-tailed t test analysis) against the CAEWW method that is intended to select more general codes (ie, smaller depths). The CAEWW method selected codes with average depths of 1.38, 1.42, and 1.99 for the DAD, NACRS, and PIN data sets, respectively. Although CAEWW’s code depths were significantly lower than the depths of the codes selected by the other methods (all P<.05), there was no significant difference between the CAEWW and CAENW methods (with average depths of 1.48, 1.45, 2.03; all P>.05). The P values can be found in Table S3 of . illustrates the difference in average code depth among the different methods.

We used the average of the mean absolute Shapley values from each mortality prediction model as an index for the importance of the features selected by each method. To compare the CAEWW method with the other methods across the 3 different data sets, we conducted a 2-tailed t test analysis. The CAEWW method did not show any significant difference from the CAENW methods across all data sets (all P>.05). However, the CAEWW method did yield significantly higher mean absolute Shapley values compared to the other methods (AEFS and PFA; all P<.001). The AEFS and PFA methods had the lowest Shapley values compared to all the other methods, indicating that they selected lower-quality features for this task. The P values are available in Table S4 in . illustrates the aforementioned differences in mean absolute Shapley values. shows the Shapley plots of the 20 most important features selected by the CAEWW method across different data sets. The corresponding Shapley plots for the other methods can be found in (Figures S1-S5). Additionally, (Tables S1-S18) includes all chosen features, detailed descriptions, and the average absolute Shapley values across all data sets and methods.

‎

Figure 2. Average depths of the selected codes by each method in the ICD-10-CA or ATC tree. Methods with average depths significantly (P<.05) larger than the CAEWW method in their corresponding data set are marked with asterisks (*). AEFS: autoencoder-inspired unsupervised feature selection; ATC: Anatomical Therapeutic Chemical CAENW: concrete autoencoder with no weight adjustment; CAEWW: concrete autoencoder with weight adjustment; DAD: Discharge Abstract Database; ICD-10-CA: International Classification of Diseases, Tenth Revision, Canada; LS: Laplacian score; MCFS: unsupervised feature selection for multicluster data; NACRS: National Ambulatory Care Reporting System; PFA: principal feature analysis; PIN: Pharmaceutical Information Network. ‎

Figure 3. Average of mean absolute Shapley values of features in each mortality prediction model. Methods with average values significantly (P<.05) smaller than the CAEWW method in their corresponding data set are marked with asterisks (*). AEFS: autoencoder-inspired unsupervised feature selection; CAENW: concrete autoencoder with no weight adjustment; CAEWW: concrete autoencoder with weight adjustment; DAD: Discharge Abstract Database; LS: Laplacian score; MCFS: unsupervised feature selection for multicluster data; NACRS: National Ambulatory Care Reporting System; PFA: principal feature analysis; PIN: Pharmaceutical Information Network. ‎

Figure 4. SHAP values of the features selected by the CAEWW method across different data sets (20 most important features): (A) Discharge Abstract Database (DAD), (B) National Ambulatory Care Reporting System (NACRS), and (C) Pharmaceutical Information Network (PIN). CAEWW: concrete autoencoder with weight adjustment; SHAP: Shapley additive explanations.
DiscussionPrincipal Findings

The high dimensionality of the ICD and ATC code databases necessitates the use of dimensionality reduction techniques to feed the data into machine learning models. Due to interpretability concerns in the health domain, selecting from original features, rather than transforming them into new features, is an essential step in reducing dimensionality. In this study, we demonstrated that the CAE methods performed the best in selecting the most informative ICD and ATC codes in an unsupervised setting. Using a clinical outcome as a case study, we also demonstrated that ICD and ATC code features selected by the CAE methods were able to predict the outcome variable with better accuracy than the other methods in the study, even though they were derived from an unsupervised setting in the absence of the target variable. This indicates that the selected features can be considered unbiased toward a specific target variable and explain the phenomenon appropriately. We also showed that the AEFS and PFA methods did not select high-quality features in our data set and were not suitable for both tasks of reconstructing the feature space and predicting 90-day mortality. The LS and MCFS methods, however, showed better performance in both tasks (slightly lower than the CAE methods). Furthermore, the features selected by the CAE methods (especially CAEWW) were generally higher-level codes (ie, lower depth in the hierarchical structure), which helps the study to find generalized solutions.

It is worth mentioning that our methodology code is publicly shared, allowing other researchers to use the desired methods for selecting the most informative features within cohorts with large ICD, ATC, or other hierarchical-coded health databases [].

Computational Cost

The MCFS, PFA, and LS methods had multiple special matrix operations that made them computationally expensive. Considering our large-scale, high-dimensional data set, these algorithms were not possible to run on a personal computer and we had to optimize the operations for an advanced computing cluster with 40 Intel Xeon Gold 6342 2.80 GHz CPUs and 2048 GB RAM. The MCFS and LS feature selection experiments had some shared operations and together took over 2 days to complete. The PFA method also needed less than a day for the entire feature selection experiments. The AEFS and CAE methods, however, had the advantage of using GPUs and optimized deep learning libraries for training the neural networks and were faster. Each of these algorithms took less than 4 hours on an Nvidia A100 80 GB GPU.

Selected Features

The Shapely analysis of the 20 most important features selected in each data set using the CAEWW method for predicting mortality revealed the multidimensional capabilities of this method in identifying relevant information. In the DAD and NACRS data sets, it selected disease codes relevant to mortality among patients with CAD. In both data sets, diseases related to cardiovascular conditions, hypertensive and circulatory disorders, metabolic disorders, renal failures [], and cancer [] were selected, which are important factors in the outcomes of patients with CAD. Furthermore, DAD-based features included accidents, arthropathies, and hospitalization-specific conditions, whereas the NACRS data set resulted in falls []; digestive disorders []; and codes related to the rehabilitations, management, or complexity of the diseases. In the PIN data set, direct interventions for CAD and related risk factors were mainly selected, including high-ceiling diuretics, statins, ace inhibitors, angiotensin II receptor blockers (plain and combinations), direct factor Xa inhibitors, vasodilators, antianemic preparations, antithrombotic agents, and other lipid-modifying agents, addressing heart failure, cholesterol management, blood pressure control, anticoagulation, anemia, and blood flow. Also, drugs related to accompanying diseases or conditions with CAD were selected: gastrointestinal issues (pantoprazole and general drugs for acid-related disorders [,] and drugs for constipation []), pain management (opioids, other analgesics, and antipyretics) [], inflammatory conditions and immune responses (anti-inflammatory and antirheumatic products and corticosteroids for systemic use) [,], mental and behavioral health (antipsychotics) [], respiratory conditions (adrenergic inhalants []), and urological issues [].

Previous studies typically selected codes as machine learning model features based on expert opinions, the presence of high-level codes (eg, categories or chapters), or a combination of both [,]. To the best of our knowledge, only 1 study [] attempted to offer a sophisticated feature selection method using tree-lasso regularization for ICD code data sets, but it was in a supervised setting that required an outcome variable. Our study provides a general tool for health researchers to select the most informative ICD and ATC codes without biasing the study toward a specific outcome variable. We also introduced a unique target weight adjustment function to the CAE model to guide the model to select higher levels of the ICD table compared to the model without adjustment.

Limitations and Future Work

One of the limitations of this study was the incapability of the CAE method to select an exact number of desired features. Since the neurons in the concrete selector layer work independently, there is a possibility of selecting duplicate features. Therefore, the number of final selected features can be fewer than the desired number. Although it indicates that the decoder model is still capable of reconstructing the initial feature space with a smaller number of features, some researchers may prefer to have an exact number of features they desire for their models. One previous study [] has used a special regularization term in the training step to enforce the model not to select duplicate features. This method can be investigated for the ICD and ATC codes in the future.

The aggregation of codes should be viewed as a trade-off in this study. We needed to select a reasonable aggregation period that covers both long-term and short-term diseases. A shorter period could skew the results by including multiple correlated records from the same patient. Conversely, longer periods could weigh short-term diseases equally with long-term ones, and the codes of the patients with fewer records (eg, recent patients in the cohort) would have a lower chance of selection.

Another limitation was that we only used 3 data sets of a specific disease cohort to choose the features. Therefore, the selected features in this study may not generalize to other patient cohorts or diseases. Furthermore, we selected the 100 best features, but other data sets or patient cohorts may require a different number of features. Future studies may investigate the impact of the number of features on the results. Moreover, our hyperparameter analysis was conducted within a constrained scope due to limited computational resources. Future studies could further explore the impact of a broader range of hyperparameter values. We anticipate that the CAEs hold potential for this area due to their flexible neural network structure and optimized algorithms. A similar limitation also applies to the mortality prediction case study, where we only trained XGBoost and did not explore other model types.

Conclusions

In this study, we investigated 5 different methods for selecting the best features in ICD and ATC code data sets in an unsupervised setting. We demonstrated that the CAE method can select better features representing the whole data set that can be useful in further machine learning studies. We also introduced weight adjustment of the CAE method for ICD and ATC code data sets that can be useful in the generalizability and interpretability of the models, given that it prioritizes selecting high-level definitions of diseases.

The CAEWW method outperformed all other methods in reconstructing the initial feature space across all data sets. We validated the selected features through a supervised learning task, predicting 90-day mortality after discharge using 3 distinct data sets. Features selected via the CAEWW method demonstrated significantly improved performance on this task, as evidenced by the DeLong and McNemar tests. Given the advantages of the CAE method, we recommend its use in the feature selection phase of EHD analysis with ICD or ATC codes.

This study was supported by the Libin Cardiovascular Institute PhD Graduate Scholarship, the Alberta Innovates Graduate Scholarship, and a Project Grant from the Canadian Institutes of Health Research (PJT 178027).

PG led the conceptualization of the study, designed the research methodology, executed the study, performed the data analysis, and drafted the original manuscript. JL provided the data set essential for the research, contributed to study supervision, and critically reviewed and revised the manuscript. Both authors read and approved the final manuscript.

JL is the chief technology officer, a co-founder, and a major shareholder of Symbiotic AI, Inc. The authors have no further conflicts of interest to declare.

Edited by Christian Lovis; submitted 19.09.23; peer-reviewed by Tarek Abd El-Hafeez, Tongnian Wang; final revised version received 06.06.24; accepted 08.06.24; published 26.07.24.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

View original article

JMIR MEDICAL INFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study

Comments (0)