Figure 1 shows the trocar placement for a laparoscopic cholecystectomy. The camera trocar is usually placed in the median umbilical region. A trocar positioned in the lower right abdomen, along with another trocar placed paramedian on the right side in the epigastrium, are strategically used as the surgeon’s working trocars, respectively, to facilitate triangulation during the procedure. A palpation probe is introduced through an additional trocar positioned along the right subcostal medioclavicular line. This placement allows for the lever effect on the patient’s thorax, effectively keeping the liver lobe out of the field of view.
In this paper, we would like to detect these trocars in a first step by means of a camera placed in the lamp and allocate their positions. In addition, we would like to make a statement as to whether they are currently occupied with an instrument enhancing context awareness in the OR. As this task is challenging due to the rotations and translations of the lamp camera relative to the patient caused by the surgeons and the dark lighting conditions during laparoscopic interventions, we use a second camera to capture the external surgical scene from a different point of view.
Fig. 1Trocar placement during laparoscopic cholecystectomies. The common positions of the trocars are shown in red. We assign each trocar a unique identifier. The size of the markers correlates with the diameter of the trocars. Trocars with ID1 and ID2 are usually 10 mm trocars, while trocars with ID3 and ID4 are 5 mm trocars
Data setWe created a data set for the training of the extra-abdominal trocar and instrument detection. For this, we recorded four cholecystectomies with two time-synchronized extra-abdominal 2D cameras at the University Hospital rechts der Isar. The first camera (C1) recorded close-up images of the surgical area, while the second camera (C2) recorded the external surgical scene including the surgeons from a distant point of view. Figure 2 shows an example image of each camera view. Both cameras provided a resolution of \(1920\times 1080\) pixels at 30 frames-per-second (fps). The videos of C1 were labeled at 6 fps together with medical experts for trocar object detection and binary annotated with information about the occupancy state of each trocar. Due to variable lighting conditions in laparoscopic interventions and the poor visibility of many trocars under dark ambient conditions, we uniformly increased the brightness of the frames of C1 for image preprocessing. This facilitated the annotation of these frames.
Fig. 2Example of time-synchronized frames of each camera. The image of C1 is shown on the left, and the one of C2 is on the right. Note that the frame of C1 has undergone image enhancement
Network architectureOur proposed model consists of three stages. A YOLOv8 [13] model detects the individual trocars on the images provided by camera 1. Using a centroid tracker, we assign an ID to each trocar and keep track of their positions as they move through successive frames of the videos. The detected trocars are cropped and subsequently encoded with adaptive padding using ResNet18 [14]. The features of the cropped trocars (\(X_1\), \(X_2\), \(X_3\) and \(X_4\)) are concatenated with the encoded features of camera 1 (\(X_\)) and camera 2 (\(X_\)) in a joint feature vector. A temporal model classifies the occupancy state (empty, occupied or not visible) of the four trocars for each time t on basis of the feature vector. The full network architecture is illustrated in Fig. 3. Its image processing components are briefly presented below.
Fig. 3Overview of the network architecture. Trocars are detected using YOLOv8 based on images from camera 1. A centroid tracker assigns unique identifiers to each trocar, which are then subject to adaptive padding and encoded using ResNet18. The features of the cropped trocars (\(X_1\), \(X_2\), \(X_3\) and \(X_4\)) are concatenated with the encoded features of camera 1 (\(X_\)) and camera 2 (\(X_\)). This combined feature vector is fed into the temporal model, which provides the occupancy state of each trocar
Trocar detectionFor detection of the trocars and their 2D coordinates in terms of bounding boxes, we use a YOLOv8 model. By means of the 2D coordinates and the bounding box size of each trocar, the images are cropped with adaptive padding to the area of interest.
Trocar trackingBased on the 2D coordinates, we compute the centroid for each detected trocar. A centroid tracker is employed to compare the centroids of the trocars in the current image with those of the previous images. Thereby, an ID is assigned to each trocar based on the Euclidean distance. When assigning identifiers, the tracker also takes bounding box areas into account, making it more stable against camera shifts. We use the prior knowledge that the number of trocars to be tracked during the entire period is limited to a maximum of four. By establishing an identifier assignment constraint, we simplify the object tracking process and increase the accuracy of the tracking algorithm.
Occupancy state recognitionThe temporal model handles sequences of trocar crops and sequences of full images from C1 and C2 by initially encoding each image separately via a ResNet18. The resulting feature vectors are concatenated and fed into the temporal classifier for further processing. The network captures temporal relationships within the concatenated feature vectors to provide a contextual understanding of the input sequences. A classification head takes the network outputs and predicts the occupancy state for each trocar.We compare different models for temporal refinement such as an LSTM [15], gated recurrent unit (GRU) [16] and a multistage temporal convolutional network (MS-TCN) [17].
Experimental setupModel trainingWe trained the YOLOv8 model for trocar detection using the standardized YOLOv8 loss function for 30 epochs. The stochastic gradient descent optimizer, a learning rate of 1e-3 and a batch size of 16 were utilized. During training, the images were resized to \(640 \times 640\) pixels and augmentations such as color space adjustment, affine transformations and horizontal and vertical flip were enabled.
The temporal models (LSTM, GRU and MS-TCN) were trained to recognize the occupancy state of the trocars, whether a laparoscopic instrument is inserted or not. The models were trained with a batch size of 32, a hidden layer size of 128 and a sequence length of 12. We used the multiclass focal loss [18] to handle the imbalance in the class distribution of empty (10.72%), occupied (65.35%) and not visible (23.93%) trocars. All models were implemented in PyTorch and trained on an NVIDIA RTX A6000.
Evaluation metricsTo assess the performance of our network architecture, we report common classification metrics such as the precision, recall and F1 score for both the trocar detection and the recognition of the occupancy state. By providing the metrics of the YOLOv8 network, an assessment of the performance of the temporal models can already be made, since these networks rely on the trocar detections to construct the feature vector.
Comments (0)