Modality redundancy for MRI-based glioblastoma segmentation

Data and preprocessing

Data from the BraTS21 challenge’s  [2, 7, 8] original training set was used, including 1251 GBM subjects, each containing four MRI scans (T1, T1CE, T2 and FLAIR) and expert-annotated ground-truth segmentations for the CE, NEC and ED region. The data set was randomly split further into a training (70%) and test (30%) set. Images in the BraTS data set are co-registered, resampled to an isotropic voxel spacing of 1 mm, and skull-stripped [24,25,26]. Additionally, image intensities were normalized using Z-scoring.

Model training

Two 3D model architectures were included for comparison. The first model consists of the full-resolution model as defined in the nnU-Net framework [27], considered the state of the art among CNNs for segmentation. Recently, transformers are gaining popularity for segmentation tasks. Therefore, a SwinUNETR [28] model was included as well. Both architectures were trained in a similar fashion on the training set for 1000 epochs. The patch and batch size were (128, 128, 128) and 4, respectively. Cross-entropy Dice loss was used as loss function. An initial learning rate of 1e\(-4\) with a cosine annealing schedule was combined with an Adam optimizer and a 1e-5 weight decay. A dropout rate of 0.2 was used. Data augmentation included random flipping along all three axis, as well as random intensity scaling and shifting.

Generally, GBM segmentation is performed using a four-channel input (T1, T1CE, T2 and FLAIR). In this study, different models were trained with varying input channels, differing by their amount and combinations of input modalities. Along with the two model architectures, this gave rise to a total of 30 segmentation models (Fig. 1).

The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation—Flanders (FWO) and the Flemish Government. All models were trained on an Nvidia A100 Ampere GPU with 40 GB RAM together with a 16-core AMD EPYC 7282 CPU. MONAI version 1.2.0 [29] was used for model implementations. The detailed implementation for training and all trained models developed in this work are made publicly available.Footnote 1

Model evaluationSegmentation accuracy

Each model was evaluated in terms of segmentation performance on the test set through calculation of the Dice similarity coefficient between the predicted and ground-truth segmentation, according to formula 1.

$$\begin \textrm = \frac} + \mathrm FP + \mathrm FN} \,, \end$$

(1)

where TP is the amount of true positive voxels, FP is the amount of false positive voxels and FN is the amount of false negative voxels.

The prediction consists of three labels: necrosis (NEC), contrast-enhancing tumor (CE) and edema (ED). Consequently, Dice scores were calculated over three regions: enhanced tumor (ET = CE), tumor core (TC = CE + NEC) and whole tumor (WT = CE + NEC + ED). An overall Dice score was obtained by averaging the Dice over these three regions.

Subsequently, segmentation accuracies of all configurations are compared to the full-input configuration. From this, we define the minimal-input configuration as the configuration with the fewest amount of modalities that yields similar performance as the full-input configuration. Analogously, we define the optimal performance configuration as the configuration with the highest performance, favoring fewer modalities in case of equality.

Segmentation uncertainty

The use of Monte Carlo dropout (MCDO) [30] allows the comparison of epistemic uncertainty between configurations, i.e., the uncertainty of the model, which can originate from its architecture and the representativeness of the training data set [31]. Since for each of the tested architectures our models only differ by the used modalities in training data, comparison of these uncertainties allows to assess how the presence or absence of particular input modalities influences the model’s certainty about its prediction.

The MCDO method includes using dropout during inference and producing multiple segmentation outputs for the same input. For each input, 30 samples were generated with a dropout rate of 0.3 for nnU-Net and 0.5 for SwinUNETR, as per recommendation from literature [30, 32]. For each voxel, the standard deviation over the predicted samples indicates a measure of uncertainty, from which uncertainty maps can be generated.

Table 1 Top performing configurations (according to mean Dice) per amount of input modalitiesTable 2 Segmentation accuracy performance in comparison with the full-input counterpartTable 3 Minimal-input (MI) and optimal performance (OP) configurations (according to mean Dices)

Following [33], a subject-level uncertainty score for a model’s prediction was obtained by summing the standard deviations of all voxels, followed by division by tumor volume as to compensate for varying tumor sizes.

$$\begin \textrm = \frac\sum _(\hat_ - \bar_})^2}}}, \end$$

(2)

where \(\hat_\) is the prediction for voxel i in prediction sample n, \(\bar_}\) is the mean prediction for that voxel, N is the total amount of samples and V is the tumor volume (i.e., the amount of voxels containing tumor).

Comparison of the configurations to the full-input configuration in terms of uncertainty then allows the determination of the minimal-input configuration.

Performance ranking

To simultaneously evaluate the different model configurations both in terms of segmentation accuracy and uncertainty, ranks for each model configuration were assigned per test subject. The higher the overall Dice score, the lower (i.e., the better) the ranking for segmentation accuracy. The lower the uncertainty score, the lower (i.e., the better) the ranking for uncertainty. Afterward, for each model, the ranks were averaged over all subjects.

Statistical analysis

To assess statistical significance, a Wilcoxon signed-ranks test with \(\alpha =0.05\) was conducted when comparing models.

Comments (0)

No login
gif