We start by explaining the general idea behind NCA training and inference and continue by outlining the eNCApsulate models with all of their tweaks. Finally, we will elaborate on the experimental design used in this study to train NCA models on the PC and transfer them to the ESP32 microcontroller platform Fig. 1.
Neural cellular automataNCAs are an emerging family of neural models that are gaining traction in the application of medical image processing. Their working principle combines the ideas of two bioinspired systems, bringing together neural networks and cellular automata. NCAs work on an image grid with an extended channel dimension, where a common local rule is applied to each of the cells in an iterative fashion. Figure 2 illustrates the operation that is performed on each cell, in each timestep: First, the Moore neighborhood of each cell is aggregated by applying \(3 \times 3\) image filters. We use three filter banks, one consisting of identity filters which only yield the current cell’s state, and two learned filter banks that aggregate the cell’s neighborhoods using a different filter matrix in each channel. The scalar result for each channel is stored in a vector that is passed to an multilayer perceptron (MLP), representing the common rule applied to every cell. The final result is then added to the original image buffer once all new cell states are computed. However, only 80% of cells are stochastically updated in each timestep to relax the simultaneous cell updates on the entire grid. For a more detailed overview of the NCA architecture and its key ideas, we point the interested reader to the comprehensive "Growing NCA" paper by Mordvintsev et al. [16].
Fig. 2eNCApsulate architecture for lightweight segmentation or depth estimation. (1) The channels of the input RGB image are augmented to the match the input + hidden + output channel dimension C. (2) The input image is then processed by a learned bank of \(3\times 3\) filters, and for each pixel, the concatenated result (3) is fed as an input to the NCA MLP network (4). The NCA MLP computes the image update for each cell. The result is an update vector (5) that is added to the input image buffer by a chance of 50% (stochastic cell update)
NCA model architecture and trainingWe train two models, namely eNCApsulateS for segmentation and eNCApsulateD for depth estimation. eNCApsulateS operates on an 18 channel image, whereas eNCApsulateD uses 22 channels total. In both models, the first three channels are fixed, as these are the RGB channels necessary to store the image data. In the last channel, the NCA produces the segmentation or depth map output, respectively. All channels in between are hidden channels that the NCA model uses to retain information between individual time steps. The hidden channels and output channel are initialized noise, as we found that this increases the robustness of the training.
eNCApsulateD is trained on a subset of the KID2 [17] dataset, which is passed through Depth Anything Model V2 [18] in order to obtain pseudo-ground truth depth maps. The resulting depth maps are automatically curated, as we could not fully trust the foundation model and hence had to remove image samples for which the generated depth maps appear flat. To determine whether a depth map is flat, we leverage its normalized gradient magnitude and accept it if it exceeds a threshold of 1.1. After applying this strategy, 727 annotated samples remained, which were used to train the depth estimator.
eNCApsulateD is trained with a combination of three losses: mean squared error (MSE), structural similarity (SSIM) loss and an image gradient loss, weighed \(\lambda _ = 1.0\), \(\lambda _ = 1.0\) and \(\lambda _ = 0.1\) respectively. During training, we make use of batch duplication as this has improved training stability in prior work [9]. Minibatches have size 8 (duplicated: 16) and are comprised of cropped capsule endoscopic samples , which are downsampled to \(64 \times 64\) patches. By resizing, we can cut down the high VRAM requirements during training.
For the qualitative evaluation of eNCApsulateD, we select a subset of the KvasirCapsule dataset [10] with interesting benchmark images, which we sort into five categories, each of which contains five sample images: blood, bubbles, complex folds, debris and foreign body.
Fig. 3Accuracy of different lightweight segmentation models (blue) vs. eNCApsulateS (green), and their model size in kilobytes, on a logarithmic scale
Table 1 Comparison of different small-scale segmentation models (backbones for U-Net) and eNCApsulateS, evaluated on a held-out testset, which is a subset of the KID2 dataset. Results are computed by an ensemble of models trained on the 5-fold-split.Fig. 4Qualitative segmentation results for eNCApsulateS, compared to other lightweight segmentation models based on CNN
Fig. 5Visual comparison of different monocular depth estimation approaches on a part of the benchmark dataset (subset of KvasirCapsule). eNCApsulate was trained on the KID2 dataset, whereas the other models are foundation models
Fig. 63D Projections of generated RGBD images with baselines and eNCApsulateD, for the five categories of our baseline set
All models are implemented in Python with Pytorch and trained on a PC equipped with an NVidia GeForce GTX 3090.
Porting NCAs to microcontrollersOnce the NCA model is trained and properly tested on PC hardware, and the next step is to port eNCApsulate to the ESP32-S3 microcontroller. As shown in Figure 1, we use a tiny variant of the ESP32-S3 microcontroller in our experimental setup. Despite not being a dedicated hardware accelerator for neural networks, the ESP32-S3 offers several functionalities that allow us to run NCAs efficiently. Most importantly, it features single instruction multiple data (SIMD) instructions for instruction-level parallelism. These are especially useful for matrix operations, which are needed for the forward pass in the NCA’s MLP. We make use of SIMD instructions whenever applicable, increasing the average runtime speed for a single inference from 9s to 3s. The ESP32-S3 also features a proper floating point unit (FPU), significantly accelerating floating point instructions required for the depthwise convolution.
Since we cannot rely on code optimizations under the hood of a framework like PyTorch, the entire inference loop is implemented from scratch in ANSI C. A major difference in our NCA implementation on the microcontroller is in the order in which the inference steps are executed. The stochastic cell update is the first step to be executed, as it is a 50:50 condition for the rest of the code to run through. After that, we compute the filter operations of the depth-wise convolution; however, we do not store the results of these filters in separate buffer matrices. Instead, we only make use of two buffers: one which is the actual image buffer, and the update buffer which is added to the image after all cell updates were computed. We will provide our implementation upon acceptance of this paper, so that these optimization steps can be traced easily.
Accelerating inference on the ESP32-S3Since inference in NCA models is an iterative process, they trade model size for runtime. Although this property allows us to bring the model to lightweight architectures, it comes with the price of a rather slow inference process. Typically, an NCA needs around 100 time steps (forward passes) to converge, which is negligible for inference on the GPU, but costly on the microcontroller, where each time step takes roughly 65ms.
We therefore add an extension to eNCApsulateS in the form of a temporal regularization scheme to reduce unnecessary time steps, in order to reduce the average inference time. In particular, we interrupt the inference process after a certain number of steps if no significant change in the hidden channels is observed. Once a minimum number of steps (10) is reached, we take the total absolute difference between each two consecutive hidden channel tensors, sum them up for all cells and normalize them by the number of entries. If this absolute difference falls below a threshold (we use 0.1), a cooldown counter will be decremented from 5 to 0. Once it reaches 0, the inference is stopped at the current time step, otherwise it is reset to 5. A full description of this algorithm can be found in our public supplemental material.
Comments (0)