Multi-classification of breast cancer pathology images based on a two-stage hybrid network

Multiple classification results: 8 classifications and 4 classificationsThe classification results of the first level network ResRFSAM at the image level obtained through voting strategies

In order to evaluate the performance of the first-level convolutional network ResRFSAM, these three voting strategies adopted by the first level network on the BreakHis (40×, 100×, 200×, 400×) and ICIAR2018 test subsets (Voting strategies: probability summation (Sum), majority voting (Maj), and maximum probability (Max)), the accuracy of classification at the obtained image level are shown in Table 1. For the eight breast cancer sub-type classification problems on the BreakHis (40×, 100×, 200×, 400×) datasets, the average classification accuracy obtained by ResRFSAM at different magnifications on the test subsets are 85.82%, 93.67%, 94.36%, and 87.93%, respectively. For the four breast cancer sub-type classification problems on the ICIAR2018 (200×) dataset, the average image-level classification accuracy obtained by voting strategy calculation is 86.25%. Figure 2(a) shows the accuracy variation of the training set and validation set during the training process of the first-level network on different datasets. After 20 iterations, the network tends to converge, and the training and validation accuracy are the prediction accuracy of the first-level ResRFSAM network on image patches, rather than the image-level accuracy obtained by voting.

Table 1 Multi-classification accuracy of the ResRFSAM network and multi-classification results of the ensemble model ResRFSAM-LSTMThe classification results of the ensemble model ResFSAM-LSTM at the image level

The multi-classification results of the ensemble model on the BreakHis (40×, 100×, 200×, 400×) and ICIAR2018 test subsets are shown in Table 1. For the 8 sub-types of breast cancer classification problems in the BreakHis (40×, 100×, 200×, 400×) dataset, the accuracy obtained by the ensemble model test was 93.67%, 97.08%, 98.01%, and 94.73%, respectively. For the 4 sub-types of breast cancer classification problems in the ICIAR2018 (200×) dataset, the accuracy, precision, recall, and F1_Score obtained by the ensemble model test were 93.75%, 92.5%, 92.5%, and 92.5%, respectively. Figure 2(a) also shows the accuracy changes of the training set and validation set of the second-level network training process on different datasets, and the network tended to converge after 20 iterations. Similarly, the accuracy calculated in the figure took into account the predicted results of the original images and various deformations, rather than being calculated from the final predicted results of the original images.

Fig. 2figure 2

Accuracy and Classification confusion matrix of model. a Accuracy of training set and verification set during network training process. b Multi Classification confusion matrix of ensemble model on different datasets

Furthermore, in order to better understand the classification performance of the ensemble model, confusion matrices were plotted on five test subsets as shown in Fig. 2(b), including the confusion matrices for the 8-class tasks of the four different magnification test sets of the BreakHis dataset and the 4-class task of the ICIAR2018 test set. It can be inferred that in the BreakHis dataset, the highly similar morphologies of papillary carcinoma (PC) and mucinous carcinoma (MC) lead to confusion in classification and consequently degrade the model performance. Similarly, in the ICIAR2018 dataset, the high similarity between benign tumors (B) and normal (N) causes confusion in classification, also lowering the model performance. The ROC curves of the model on each test subset are shown in Fig. 3(a).

Comparison of first-level network voting results with ensemble model classification results

The bar chart was used to compare the image-level classification accuracy obtained by voting for the ResRFSAM network on different datasets with the accuracy predicted by the ResRFSAM-LSTM ensemble model. As shown in Fig. 3(b), for the two multi-classification tasks of breast cancer pathological sub-types, the classification accuracy obtained by the ensemble model was significantly higher than that obtained by the first-level network voting. This is consistent with clinical diagnosis experience, as the difference between benign and malignant sub-types of breast cancer is subtle and breast tumors exhibit heterogeneity. It is not enough to rely solely on local information to ensure accurate diagnosis; one must also pay attention to the overall image’s organizational and spatial structure information. These comparative results partially demonstrate the rationality and effectiveness of the model design presented. They indicated that the second-level network, based on the local patch features extracted by the first-level network, captures more critical features favorable for classification, including higher-dimensional features extracted by further training of iterative learning, and context information and spatial structure between patches from the same image captured through LSTM recurrent convolutions. These features extract global information about pathological images.

Fig. 3figure 3

ROC curves and Accuracy of different datasets. a ROC curves on different datasets. b Comparison of classification accuracy between first level network voting and ensemble models on different datasets

Additionally, in the 8-classification task, both the first-level network and ensemble model achieved the best classification results on the BreakHis 200× data testing set, while results were slightly worse on other magnifications. This is caused by the nature of the data at different magnifications, where subtle features such as cell nuclei are not prominent in lower magnification data, while spatial structure and surrounding tissue distribution morphology information is limited in higher magnification data. Particularly in multi-classification tasks, different magnifications of pathological images significantly affect diagnosis results, and pathologists consider tissue structure at different magnifications when diagnosing pathological images.

Contrast experimentExplore the optimal parameter combination for feature fusion of fully connected layers

In order to explore the optimal parameter combination (θ1 and θ2 as shown in Fig. 1(b)) for feature fusion in fully connected layers, we conducted comparative experiments on the ICIAR2018 dataset to select and determine the optimal combination values. In the experiment, the network structure of the first-level network ResRFSAM is not changed, and only different θ1 and θ2 training models are set up and compared for results. Supplement Table 3 records the classification accuracy of the model on the test subset under different values of θ1 and θ2, including the results obtained by the first-level network voting and the ensemble model prediction.

As shown in Supplement Table 3, when θ1 and θ2 take the values (0.5, 0.5), the average accuracy obtained from the first-level neural network voting is 88.33%, slightly higher than the case of (0.4, 0.6) (86.25%). However, the classification accuracy of the ensemble model is slightly lower than the latter, at 91.25% and 93.75%, respectively. In addition, when the values are (0.8, 0.2), the average accuracy obtained from the first-level network voting is 83.3%, significantly lower than the average accuracy obtained under most other combined parameter conditions in the table, but its corresponding ensemble model classification accuracy reaches 92.5%, slightly lower than the optimal result in the table, which is 93.75% (with parameters set to 0.4 and 0.6). Therefore, it can be seen that in the ensemble model, only optimizing the first-level network should not be considered, but the optimization of both levels of networks should be comprehensively considered to improve the classification performance of the ensemble model. The first-level network trained based on patch should not only extract the fine local features of the patch well but also extract and retain the global features and spatial structure of the patch, so as to provide sufficient usable features and conditions for the second-level network, which is the bidirectional LSTM, to extract the global features of the image and achieve the prediction and classification of the image. Based on this, all the experiments conducted under the setting of the optimal combination θ1 = 0.4 and θ2 = 0.6.

Exploring the impact of the first level network structure on model performance

In order to verify the effectiveness of the first-level deep network ResRFSAM constructed for multi-classification tasks, we compared the model performance of the first-level network using the RFSAM-Net and the ResRFSAM network, including image-level classification accuracy and ensemble model classification accuracy obtained by the first-level network through voting calculation.

Table 2 Comparison of classification accuracy of models using RFSAM-Net or ResRFSAM network as the first-level network on BreakHis (40×,100×,200×,400×) dataset

As shown in Table 2, the classification accuracy of image-level comparisons using two different network architectures was contrasted on the BreakHis dataset. The RFSAM-Net obtained the highest image-level classification accuracy by voting on the 40×, 100×, 200×, and 400× datasets respectively, with a percentage of 85.82%, 91.48%, 93.03%, and 88.64%. Meanwhile, the ResRFSAM network obtained the highest accuracy of 87.59%, 94.16%, 97.01%, and 91.97% respectively through voting in the four magnification levels. The ResRFSAM network outperformed the RFSAM-Net in all magnification levels on the BreakHis dataset. As for the comparison of ensemble models, namely the RFSAM-LSTM and ResRFSAM-LSTM models, the latter’s classification accuracy still exceeded the former’s as shown in Table 2. Additionally, as shown in Table 3, for the 4-class classification task on the ICIAR2018 dataset, both in terms of the accuracy of image-level classification obtained by voting and the ensemble model level classification accuracy, the ResRFSAM network had better performance than the RFSAM-Net.

Table 3 Comparison of classification accuracy of models using RFSAM network or ResRFSAMnetwork as the first-level network on ICIAR2018(200×) dataset

Firstly, although the RFSAM-Net performs worse than the ResRFSAM network in the multi-classification task of breast cancer pathological images, it achieved better performance in the binary classification task. In addition, as shown in Supplement Table 4, the parameter and computational complexity of the ResRFSAM network are much higher than those of the RFSAM-Net. Due to the complexity of pathological image structures and subtle differences between subtypes of pathological images, lightweight models cannot extract higher-dimensional features from images and cannot focus on subtle differences between different sub-type images. Discrimination between benign sub-types and malignant sub-types is prone to confusion, especially when distinguishing between mucinous carcinoma and papillary carcinoma, benign and normal categories with similar structures. Compared to the lightweight RFSAM-Net, the ResRFSAM network effectively extracts higher-dimensional complex features through a combination of 34 convolutional and RFSAM convolutional modules, and fully utilizes high-dimensional features through feature fusion in the fully connected layer, capturing subtle differences between different sub-types, such as cell nucleus morphology and tissue spatial structure. It is more suitable for accurate classification tasks of breast pathological images sub-types.

Comparison with other existing methods

In order to verify the effectiveness of the proposed ResRFSAM-LSTM model in the multi-classification of breast cancer pathological images, we compared the classification accuracy of the proposed model with other existing models on the BreakHis dataset and the ICIAR2018 dataset. This includes the image classification accuracy obtained by the first-level network ResRFSAM with the voting strategy proposed, as well as the classification accuracy of the ensemble model.

As shown in Table 4, the proposed model and other existing models are compared in terms of the classification accuracy of the 8 sub-types in the BreakHis dataset. It can be observed that the image-level classification accuracy obtained by the first-level network ResRFSAM with voting strategy is slightly lower, but it is still higher than the accuracy reported by the IDSNet model on the 100×, 200×, and 400× datasets, which to some extent proves the effectiveness of the ResRFSAM network in patch feature extraction. As for the classification accuracy of the ensemble model (ResRFSAM-LSTM) proposed, the proposed model performs better than other existing models on the BreakHis 200× dataset, achieving the highest classification accuracy (98.01%). The classification accuracy on the 100× dataset (97.08%) is slightly lower than the existing optimal accuracy (97.42%), and the classification accuracies tested on the 40× and 400× datasets are 93.67% and 94.73%, respectively, which are still better than most models.

Table 4 Comparison of classification accuracy between the model and other existing models for eight categories of BreakHis datasets

As shown in Table 5, the classification accuracy of four sub-types on the ICIAR2018 dataset was compared between this paper’s model and other existing models. In comparison, the image-level classification accuracy (83.75%) obtained by the first-level network ResRFSAM’s voting was only superior to the reported classification accuracy (77.8%) in reference, and had no advantage over other deep network models. It is worth noting that after the first-level network ResRFSAM was trained, it was mainly responsible for feature extraction of patches rather than completing predictive classification in the model. Therefore, patch selection was performed in the training phase in this paper, but no additional patch screening was performed in the testing phase. The images in the ICIAR2018 dataset have a higher resolution, and patches extracted by sliding windows do not necessarily contain relevant information such as cell nuclei. Some patches only contain irrelevant components such as cell cytoplasm, which interferes with the voting classification results. The ensemble model obtained a classification accuracy of 93.75%, which is superior to other existing optimal models.

Table 5 Comparison of classification accuracy between the model and other existing models for four categories of ICIAR2018 datasetAblation experiments

We explored the impact of three fully connected layers (3FC) and RFSAM on the performance of the proposed ResRFSAM model in the first level network through ablation experiments. Table 6 shows the classification results of the network model under different conditions, that is, whether to use 3FC and RFSAM or not: “ResNet34” represents the original ResNet34 structure without using 3FC and RFSAM structure; “ResNet34 + FRFSAM” represents the ResNet34 structure with only the addition of RFSAM structure; “ResNet34 + 3FC” represents the ResNet34 structure with only the addition of 3FC structure; “ResNet34 + RFSAM + 3FC” represents the first-level network model ResRFSAM in this paper, which uses both 3FC and RFSAM structures on the basis of the original ResNet34. The experiment was carried out on the ICIAR 2018 dataset, and the second-level network in the ensemble model remained consistent, using a two-layer bidirectional LSTM network.

Table 6 Comparison of classification results on ICIAR2018 dataset using models without three-layerfull connection (3FC) or RFSAM

As shown in Table 6, although the ResNet34 + 3FC network structure has a slightly better image-level classification accuracy obtained through voting compared to the ResNet34 network obtained through voting, with average accuracy of 78.33% and 77.08%, respectively, the classification accuracy obtained by the ensemble model has significantly improved, with accuracies of 82.5% and 80.0%, respectively, an improvement of 2.5%. The above results indicate that adding FC and fusing the features extracted from the fully connected layers helps to enhance the representation ability of the ensemble model, optimize the network weight distribution, and improve network performance.

Furthermore, we compared the impact of using the RFSAM module in the ResNet34 network on model performance. As shown in Table 6, compared to not using the RFSAM module, the test accuracy obtained by the network obtained through voting and the accuracy obtained by the ensemble model were improved by about 6% and 7%, respectively: the ResNet34 + RFSAM network and the ResNet34 network obtained through voting had average test accuracies of 83.75% and 77.08%, respectively, and corresponding ensemble model test accuracies of 87.5% and 80%, respectively. These results indicate that whether obtaining image-level classification accuracy through voting or ensemble model classification results, the model using the RFSAM module performs significantly better than the model without the RFSAM module. These comparison results further demonstrate the improvement in network performance by the RFSAM module.

The performance of the model with the first-level network structure consisting of ResNet34 + RFSAM + 3FC (ResRFSAM) is significantly better than the other models presented in Table 6. Both the accuracy obtained by the voting strategy and the corresponding accuracy obtained by the ensemble model testing are optimal and are 83.75% and 93.75%, respectively. This indicated that the setting of fully-connected layer feature fusion and the introduction of the RFSAM module in the model are more conducive to enhancing the feature extraction ability of the model. When training based on patches, it not only enhances the first-level network’s ability to extract subtle local features in patches but also extracts and retains the global features and spatial structure of the patches. This provides better conditions for the second-level network of bidirectional LSTM to extract the global features of the image and predict its classification.

Comments (0)

No login
gif