An overview of the proposed method is shown in Fig. 2. The chest X-ray image taken is immediately given to the RAS, which outputs an assessment of whether retaking is required or not. The assessment result was provided to the operator as a second opinion, after which the operator made a final decision.
Fig. 2Overview of proposed flow
For the CNN to provide whether retaking is necessary, training using the input data and corresponding labels is required. In this study, the input data were chest X-ray images and labels indicated whether retaking was required. However, chest X-ray images alone cannot distinguish whether inspiration is sufficient and retaking is unnecessary or insufficient and retaking is required. Therefore, we generated input images and labels from dynamic digital radiography (DDR) and used them to train and validate a CNN [11, 12]. In addition, we verified the effectiveness of the proposed method using actual chest X-ray images. The details of this process are provided below.
2.2 Dataset2.2.1 Dynamic digital radiographs (DDR)This study included dynamic digital radiographs of 80 cases examined at Shinshu University Hospital between October 14, 2020, and April 18, 2022. Of these, 18 cases remained after excluding cases taken after surgery, such as lung resection and cases in which drains were inserted. DDR was conducted using an Aero DR fine (KONICA MINOLTA, Tokyo, Japan) flat panel detector and a UD150B-40 (SHIMAZU CORPORATION, Kyoto, Japan) X-ray system. The images were automatically analyzed using the workstation KINOSIS (KONICA MINOLTA, Tokyo, Japan). The matrix size was 1062 × 1062 pixels.
Figure 3 shows the flow of preparing training data from the DDR. When taking the DDR, the patient was instructed by an automatic voice to breathe in the following order: maximum inspiration → maximum expiration → maximum inspiration. The images were taken continuously at 15 frames per second and 300 frames were obtained in one examination. By providing the acquired images to a workstation, the upper and lower edges of the lung field were automatically detected for each frame and the position data of these two points was obtained. The relative distance between the two points was calculated using Eq. (1), and the calculated value was defined as the inspiration rate for each frame. In addition, the images were converted to the PNG format with a matrix size of 224 × 224 pixels by bicubic interpolation.
$$}\left( i \right) = \frac}}} }}}}} - D_}}} }} \times 100$$
(1)
where D(i) is the distance D between the upper and lower ends of the lung field in frame i, Dmax is the maximum distance D during the examination, and Dmin is the minimum distance.
Fig. 3The flow of training data preparation
Visual evaluation was performed to set the retaking threshold based on the inspiration rate. For this evaluation, we used RadiForce RX360 (EIZO Corporation, Ishikawa, Japan), a 3-megapixel (1536 × 2048 pixels) liquid–crystal display (LCD). The LCD was calibrated based on the grayscale standard display function described in Digital Imaging and Communications in Medicine Part 14 [13], with a recommended luminance of 500 cd/m2. The illuminance of the observation environment complied with the JIS Z 9110 [14]. The observers included 10 radiological technologists (10.4 ± 6.6 years of clinical experience). Informed consent was obtained from all observers to participate in the study and disclose the results of their visual evaluations. In addition, we obtained agreement to cancel the evaluation results if participants expressed a desire to withdraw from the study. The observers who participated in this study routinely took chest X-ray images. We deliberately did not provide prior training to ensure that clinical judgment was as accurately reflected as possible in the visual assessments. In total, 108 images with inspiratory rates of 0, 20, 40, 60, 80, and 100% were used in each case. With regard to visual evaluation methods for creating training data, some reports reflect the results of one or two observers [15,16,17]. In addition, other studies incorporate the evaluation results of a third observer only when the results of two observers diverged [18], made a final decision through consensus when the results of two observers contradicted each other [19] or used most of the evaluation results from three observers as the ground truth [4]. Based on these studies, we adopted the 2-phase visual evaluation method to generate reliable training data.
In phase 1, all the images were randomly sorted and displayed individually. Each radiological technologist conducted evaluations independently, with two options: “Inspiration is sufficient” or “Inspiration is insufficient.” The observation time was set at approximately 3 s per image to reflect the clinical work situation. In addition, re-evaluating an image once it had been rated was prohibited, and the evaluation was based on only one displayed image. The evaluation results were analyzed for each case and observer, and the threshold of individual observer was obtained. An example of this is shown in Fig. 4. If the evaluationresults were divided at a certain inspiratory rate, that rate was set as the threshold (Observer 1). If the same evaluation result was obtained for all inspiratory rates, the threshold was 0% or 100% (Observer 2 or 3). If the evaluation results were mixed, they were excluded from the analysis (Observer 4). In addition, the mean and standard deviation (S.D.) of the threshold of individual observer were calculated. If the change in the images was small despite the different inspiratory rates, the evaluation results would likely be “invalid.”
Fig. 4Analysis method of visual evaluation results
In phase 2, the consensus threshold was determined through a consensus between the 10 radiological technologists by reviewing the results of phase 1 and images of each case. As a result of phase 1, the mean and S.D. of the threshold of individual observer, and the number of invalid results were displayed. The images presented were the same as those in phase 1 (inspiratory rates of 0, 20, 40, 60, 80, and 100%). For comparison at a glance, the images were arranged in descending order of inspiratory rates for each case. The observers were the same as those in phase 1, and the time for discussion was arbitrary. The consensus threshold was selected from inspiratory rates of 0, 10, 30, 50, 70, 90, and 100%. Table 1 presents the results of the study. The results of each observer in Table 1 (threshold of individual observer) are the reference data, and the threshold on the far right (consensus threshold) was used as the retaking threshold for training.
Table 1 Results of visual evaluationFrames with an inspiratory rate higher than the retaking threshold were labeled as a complete examination (“Complete”), whereas frames with an inspiratory rate equal to or lower than the retaking threshold were labeled as requiring retaking (“Retake”). The 300 frames obtained from one DDR examination were added to the CNN training data as 300 chest X-ray images.
In addition, a questionnaire was conducted to clarify the areas that the observers paid attention to when checking the state of inhalation on chest X-ray images. This questionnaire was conducted immediately after the visual assessment, and the response time was arbitrary. The options included the position of the upper end of the diaphragm and costophrenic angle, radiolucency within the lung fields, expansion of the thorax, overlapping diaphragm and cardiac shadows, cardiac shadow, the position of the clavicle, and others. Theparticipants were allowed to select all applicable options. The result of the questionnaire is shown in Table 2.
Table 2 Results of the questionnaire regarding the area to pay attention to when checking the inspiratory state2.2.2 Chest X-ray imagesCNN was trained using DDR, whereas our proposed method targets actual chest X-ray images. Therefore, we conducted an additional validation to confirm whether the proposed method is useful for actual chest X-ray images. This study involved 95 chest X-ray cases in which patients were requested to take two images, one in the maximum inspiration state and one in the maximum expiration state, at Shinshu University Hospital between October 14, 2020, and December 1, 2023. Patients whose images were taken in a position other than standing and those that overlapped with chest dynamic digital radiographs were excluded. Consequently, 48 patients were included (96 images). These images were obtained using three different X-ray machines. Twelve cases were obtained using a Digital Diagnosis C90 (Philips Healthcare, Cleveland, OH, USA), 35 cases were obtained using a UD150B-40 (SHIMAZU) X-ray generator and CXDI-401C (CANON Medical Systems, Otawara, Japan) flat panel detector, and one case was obtained using a UD150B-40 (SHIMAZU CORPORATION, Kyoto, Japan) X-ray generator and an Aero DR fine (KONICA MINOLTA, Tokyo, Japan) flat panel detector. The maximum matrix size was 3320 × 3408 pixels.
The markers in the images were removed. The target area was specified and the pixel value was replaced with zero using ImageJ [20]. TheExamples are shown in Fig. 5. Furthermore, the image format was changed to PNG and the matrix size was converted to 224 \(\times\) 224 pixels using bicubic interpolation.
Fig. 5The example of removing the image marker
2.3 Image classificationAs mentioned previously, we used a CNN to assess whether retaking was necessary. Transfer learning was introduced to train the CNN. Transfer learning is a technique that uses weights learned for one task as a starting point for another task and has the advantage of being efficiently trained even with small datasets [21]. In this method, the weights trained on ImageNet [22] were used as the initial values and the fully connected layers and beyond were additionally trained using the DDR. Here, the X-ray image is a one-channel grayscale image, whereas the input information for the CNN is a color image. Therefore, we assigned the same grayscale image to each plane of the three channels of the color image and fed it into the CNN. When a chest X-ray image is provided to the trained CNN, it outputs whether or not retaking is necessary.
2.4 Evaluation method2.4.1 Cross-validation using DDRTo confirm the effectiveness of the proposed method, we conducted a six-fold cross-validation using DDR. Of the 18 dynamic digital radiographs, 13 were used as training data, two as validation data, and three as test data. The validation was conducted six times, rotating the cases so that each case served as test data. As shown in 2.2.1, one DDR consists of 300 frames. Each frame was used as a normal chest X-ray image for verification. In addition, the training data were augmented using image enlargement processing to address differences in body size. Two enlargement ratios (× 1.045 and × 1.090) were applied to the “Complete” images to triple the amount of data, and three enlargement ratios (× 1.03, × 1.06, and × 1.09) were applied to the “Retake” images to increase the amount of data four-fold. Consequently, the number of training data was approximately 13,500 images per fold, with an improved balance between the “Complete” and “Retake” categories.
Seven well-known CNN architectures namely VGG16, VGG19 [23], InceptionV3 [24], ResNet50 [25], DenseNet121, DenseNet169, and DenseNet201 [26], were adopted to compare processing accuracy and determine the most suitable model for the proposed method. In addition, to adapt each model to the target task, the fully connected layer was modified to assess whether retaking was necessary. The fully connected layer had a three-layer structure with each layer having 2048, 512, and 1 units. The Sigmoid function was used for the final layer. The VGG16-based architecture is shown in Fig. 6. The batch size was set to 128, and Adam (learning rate (Lr) = 1e−4, 1e−5, and 1e−6) was used as the optimization function [27]. Binary cross-entropy was used as the loss function. Early stopping was adopted to avoid overfitting and the maximum number of epochs was set to 100. The validation function was monitored and training was stopped if no improvement was observed within three iterations. The hardware used for calculations were an Intel Core i9-12900 CPU(Intel, Santa Clara, CA, USA) and NVIDIA GeForce 3090 GPU(Intel, Santa Clara, CA, USA) with TensorFlow (Google, Mountain View, USA) and Keras software. Furthermore, a heat map was generated using Grad-CAM [28] to visualize the areas of the image CNN focused on for assessment.
Fig. 6CNN architecture based on VGG16
2.4.2 Verification using actual chest X-ray imagesVerification was performed using chest X-ray images to confirm the usefulness of the proposed method. All DDR cases were used for training and randomly divided into 15 (4500 images) and three (900 images) cases for the training and validation data, respectively. The test data consisted of 96 actual chest X-ray images. The system was retrained using a larger dataset than that used in the cross validation to process the actual chest X-ray images.
The three architectures demonstrated high accuracies in the verification described in Sect. 2.4.1 were adopted as the CNN models. The hardware, software, and parameters were the same as those described in the previous section. A heat map was generated using Grad-CAM in the same manner.
Comments (0)