Automated Grading of Vesicoureteral Reflux (VUR) Using a Dual-Stream CNN Model with Deep Supervision

Dataset Characteristics

We retrospectively collected 1529 cases of pediatric patients, each associated with one VCUG image. The dataset was divided into two subsets: 1229 images for training and 300 images for testing. The detailed label distribution is presented in Table 1. The ground truth for VUR grading was established based on the International Reflux Society classification criteria [12], with two expert pediatric urologists determining the accurate labels for the dataset.

Table 1 Dataset distributionData Processing

All VCUG images were initially cropped to focus on the bladder and ureteral regions, resulting in a final image resolution of 768 × 768 pixels. To enhance image contrast, histogram equalization was applied individually to the red, green, and blue channels of each image. To increase the diversity of the training set and enhance model robustness, data augmentation was employed through random flipping with 50% of probability. This straightforward yet effective method improved the model’s generalization by introducing variations in image orientation.

Methods

Our approach consisted of a multi-stage process to comprehensively assess VUR and its clinical implications. The first stage involved a binary classification task, distinguishing between the absence of VUR (grade 0) and the presence of VUR. This step provided an initial indication of whether the patient exhibited VUR.

In the second stage, a more refined three-way classification was performed. This classification grouped VUR grades into three categories: no reflux (grade 0), mild to moderate reflux (grades 1–3), and severe reflux (grades 4–5), where the latter often indicates the need for surgical intervention. This refined classification aimed to better guide clinical decision-making by identifying patients who may benefit from surgical treatment.

Both classification tasks were addressed using our proposed multi-head model, which independently analyzed the left and right urinary tracts. This multi-head architecture enhanced the model’s ability to capture subtle differences in VUR patterns and ensured accurate grading.

Model Details

As shown in Fig. 1, the proposed model features multi-head processing of the left and right bladder images using a dual-stream architecture built on a modified ResNet-50 backbone [13], combined with deep supervision to improve feature learning.

Fig. 1figure 1

Structure of proposed model

The base of our model consists of the first three blocks of a pre-trained ResNet-50 [13], which processes the input images and extracts feature maps through several convolutional and pooling layers. To separately process the left and right bladder images, we duplicated the fourth block of ResNet-50, creating two parallel processing streams. One stream is dedicated to extracting features from the left bladder images, while the other focuses on the right bladder images, allowing the model to learn distinct features from each side.

To further enhance feature learning, the model incorporates a deep supervision mechanism. Before processing the final block, feature maps are extracted from the third block of ResNet-50 [13]. These intermediate features are flattened and passed through a fully connected layer, which provides auxiliary outputs. The deep supervision mechanism enables the model to learn more discriminative features early in the process, potentially improving final classification accuracy.

After the parallel processing streams, we apply an average pooling layer to reduce spatial dimensions and aggregate the feature maps. The pooled features from each stream are then flattened and passed through multi-layer perceptron (MLP). We utilize two MLPs for each stream to map the high-dimensional features to the desired number of output labels.

The loss function used in the model combines the primary classification loss with the deep supervision loss [14]. The primary classification loss is computed using a focal loss [15]}, applied to the predicted grades for the left and right bladder images to address class imbalance and emphasize harder-to-classify examples during training, as shown by the following equation:

$$_=_}_,\text}_}_,\text}\right)}^\text\left(_,\text}\right)$$

(1)

where ytruc,c represents the true label for class c, ppred,c is the predicted probability for class c, αc is a balancing factor for the positive or negative class c, and γ is a tunable focusing parameter that controls the down-weighting of well-classified examples.

To incorporate the deep supervision mechanism, we compute the cross-entropy loss on the intermediate features extracted from the third block of ResNet-50. These intermediate losses, derived from the deep supervision fully connected layer, provide additional guidance to the model during training, as shown in the following equation:

$$_=-_}_,\text}\text\left(_,\text}\right)$$

(2)

where ytrue,c represent the true label for class c and ppred,c is the predicted probability for class c.

The total loss is the weighted sum of the primary classification loss and the deep supervision loss, ensuring a balanced contribution from both components. The total loss function is defined as follows:

$$_}=\alpha _+\left(1-\alpha \right)_$$

(3)

where α is a weighting factor that balances the contributions of the classification loss and the deep supervision loss.

Implementation Details

Training was conducted using the AdamW optimizer with an initial learning rate of 0.00002 and a weight decay of 0.01 over 24 epochs. The batch size was set to 16 per device, and mixed-precision training (fp16) was employed to enhance computational efficiency. The data loader utilized four worker threads to expedite data loading. Checkpoints were saved at the end of each epoch, retaining only the most recent one to conserve storage space.

All code was implemented in Python and PyTorch. The experiments were performed on a workstation equipped with four NVIDIA TITAN RTX GPUs (24 GB GPU memory each), 256 GB of RAM, and an Intel Xeon Gold 6248 CPU at 2.50 GHz, running Ubuntu 16.04.

Comments (0)

No login
gif