Introduction

Extracting quantitative information from image data is a major step in many fields of research. Prior to the last decade, state of the art algorithms typically focused on highly specific use cases, such as tracking spherical particles1 or identifying astronomical light sources2. These algorithms were typically task specific—aiming to identify predefined features—as opposed to machine learning algorithms that are more adaptive. In fact, reviews as late as 2015 did not even mention machine learning (ML)3. Progress is still being made in this ___domain today4. Since the introduction of AlexNet5 in 2012, the capacity of ML methods in this arena has moved at a breathtaking pace, fueled largely by the success of convolutional neural networks (CNNs)6. This class of techniques allows a more general approach to quantification of image data, including addressing more nuanced and harder-to-formulate questions by requiring only correct examples as training data. More specifically, the task of segmenting an image—identifying the pixels that comprise one or more objects or regions of interest—has become a large focus7, as it allows researchers to rapidly and deeply analyze complex data. While state-of-the-art benchmarks in this ___domain (e.g. ML Commons) require enormous computation and are thus out of even a skilled single user’s reach, software tools like Keras8, an Application Program Interface (API) for Python, greatly simplify the process of creating smaller, custom neural network solutions, in principle in just a few lines of code. However, in practice the process is rarely that simple, and for those unfamiliar with deep neural networks, many pieces of the process become daunting; optimizing the many user-defined “hyper-parameters” of the algorithm, picking the right network, cleaning the data, and possibly learning a new programming language can each require a lot of additional effort.

As a result, a large and recent body of work has been focused on methods and software packages for simplifying this process. The majority focused on biological research, specifically the tracking of cells from microscopy data9,10,11,12,13,14,15,16, but similar works tackle goals ranging from identifying and tracking 2D materials like graphene17 to segmenting other medical or biological imaging data18,19,20,21, images of flora and fauna22, scanning electron microscopy images for material science23,24, astronomical data25,26, particle physics27, and more. Typically these works compete for highest accuracy on benchmark data sets11, or ease of use for pre-specified domains (very often biological data)9,10. While many of these methods are likely applicable for tasks outside of their intended application, e.g.15,21, few are explicitly designed for general use, and often require usage of preexisting image analysis software such as ImageJ, and a menu of options that can be intimidating for those unfamiliar with machine learning methods. On the opposite side of the spectrum, packages designed for ‘zero-shot’ (no user-input) segmentation have become increasingly powerful 28, but lack the malleability needed in many custom research scenarios.

Here we introduce an easy-to-use segmentation solution aimed at a broad array of research applications, named “Bellybutton.” Bellybutton uses a 15-layer convolutional neural network that can be trained on as little as one (or a sub-samping of one) image with user-defined segmentation, and can account for variations in size, lighting, rotation, focus, or shape of desired segmentation regions, as is common in research applications. The algorithm operates on a pixel-by-pixel basis, determining if each is inside or outside of a segmentation (‘innies’ or ‘outies,’ hence the name Bellybutton). The algorithm can analyze input images of varying shape and size, and automatically performs a variety of data augmentation, including flipping and rotating images, normalizing brightness across images, and evenly sampling innies and outies. Bellybutton requires no coding knowledge, and can be trained and run on a laptop. We detail its performance and flexibility through several use cases including segmenting bubbles with poor lighting and focus, semi-transparent, tightly packed particles that have intricate birefringence patterns, and tracking a thin clear lattice of material that fractures over time. Each of these data sets is available online, along with a guide for Bellybutton’s use on new data sets.

Method

Bellybutton operates on a pixel-by-pixel basis, scanning images and using the neighborhood around a given point in an image to determine if a pixel is inside or outside of a segment, as well as how far from that segment’s edge. It uses a deep convolutional neural network (CNN), whose structure is shown schematically in Fig. 1A. The CNN consists of \(3\times 3\) convolutional layers, \(2\times 2\) max pooling layers, skip connections inspired by ResNet29, and ends with four dense layers feeding into two outputs—a classification of pixel type (inside or outside a region), and a distance-from-region-edge scalar value, which is used to separate distinct regions in contact. The scalar value is trained to vary between 0 (for all outside pixels) to a maximum value set by the user (typically 10), allowing the system to localize region edges while easily satisfying this output when it is unimportant, for example in the center of a 100 pixel-wide region. We use binary cross-entropy loss for the classification output, and a mean-absolute error loss for the scalar distance output with equal weighting between the two. Because the distance map is used ultimately to separate contacting regions, the accuracy of low distances is most important, making mean-absolute error a superior choice to the more standard mean-squared error. Bellybutton is built on Tensorflow30 and uses the Adam optimizer for training with a learning rate of 0.001.

The chosen network architecture strikes a balance between being small enough to train rapidly from scratch on a laptop, while being large enough to generate valid segmentation on nontrivial problems. The choice of a CNN has been the standard for segmentation problems6,12,14,18,20,22,23,24,25,26, as it allows the network natural access to spatial information. The decreasing layer size is also standard, and gives the network sufficient flexibility to hierarchically analyze spatial patterns without superfluous parameters. The network itself takes multiple size subsets of an image as input, centered around the pixel in question, each down-sampled to \(25\times 25\) pixels. This sampling process is performed automatically during training and prediction, and gives the network the ability to analyze multiple length scales while keeping input size minimal. A typical example is shown in Fig. 1A, B using 1, 3, 9, and 27x scales.

For training, a user may provide individually-labeled segmentation maps, that is, every pixel in a particular segment must contain the same number, unique to that segment. Alternatively, if no segments are in contact, a user-provided binary mask is sufficient. In either case, Bellybutton generates two labels for each pixel: a binary classification label that corresponds to ‘innie’ or ‘outie’, which does not distinguish between uniquely labeled regions, and a scalar label, distance (in pixels) to the nearest edge of a region. It is these two labels that the CNN is trained to reproduce. Optionally, the user may exclude regions of an image using a binary Area of Interest (AOI) mask, as indicated by the excluded gray area in Fig. 1C.

To avoid prolonged training, the user may select to train using a fraction of available training data. We find that near optimal results are often reached without using all available pixels (see Fig. 2E). Furthermore, rotated and flipped images are (optionally) used in training to prevent overfitting. Once trained, Bellybutton produces a classification score of 0 (outside) to 1 (inside a region) for each pixel (trained on the binary label), shown in Fig. 1D. This score is thresholded to produce a binary innie-vs-outie map. Finally, the scalar distance output, shown in Fig. 1E, is used to watershed the ‘innie’ classified pixels into distinct regions to produce a segmented map, as in Fig. 1F. Data used in this figure, aqueous foams in microgravity, comes from Ref. 31, which was the first work to utilize Bellybutton.

Figure 1
figure 1

The Bellybutton Method (A) Architecture of the 15-layer convolutional neural network. Multiple scales of an experimental image, each reduced to \(25\times 25\) pixels, are simultaneously taken as a single input. The network consists of two \(3\times 3\) convolutional layers followed by a \(2\times 2\) max pooling layer. This pattern is repeated twice more, each with skip connections as shown. The final \(2\times 2\times 96\) layer is flattened, fed through four dense layers and produces two output scalars, one signifying the class of the pixel (inside or outside of a region), the other the distance to the nearest region edge. (B) An example experimental image, overlaid with the chosen input scales 1, 3, 9, and 27x. (C) User-defined mask, in this case binary as no segments are in contact. User may also define an area of Interest (AOI), which in this example removes the edges of the image (gray) from training. (D) Class probability output after training. The network generates a prediction score on a pixel-by-pixel basis. (E) Distance map to outside of a particle (scalar output of the network). Training values are capped at a user-specified value, in this case 10 pixels, so much of the image appears binary. The zoomed-in region highlights the gray-scale output near the edges of the bubbles. (F) Final segementation is produced by watershedding the binarized classification probability (D) using the distance map (E). (D) and (E) are also saved if desired.

Example uses

Bellybutton is effective for a variety of purposes. Here we use the example of segementing a 3D printed photoelastic material in the shape of a granular packing. This material is illuminated between cross-polarizers such that it develops a birefringence pattern when under mechanical stress. This lighting is useful experimentally, but complicates the tracking process; previous experiments using photoelastic granular disks have required two sets of images, one with regular lighting to track particles, and second one with the birefringence pattern to analyze force32. Bellybutton was trained on two fourths of three images of this system, under low, medium, and high stress, and tested on the remaining two fourths of each image, shaded purple in Fig. 2A. While remaining roughly the same shape, the particles present a wide variety of patterns as the stress changes. Furthermore, a variety of confounding factors make this segmentation more difficult: A substantial portion of the image (the left and right edges) is out of focus. The camera is close enough to the sample that only particles in the center are imaged head-on, leading to different viewing angles for particles near the edges of the system. Finally, particles near the left and right edge are tilted sufficiently such that their edges are exposed to the camera.

The input scales used are shown in Fig. 2B, overlaid on zoomed-in data. Segmentation is successful, with the majority of errors concentrated at the bottom of the leftmost image, where contrast and focus are worst. Typical regions are successfully segmented, as seen by comparing Fig. 2C, D, taken from the test set.

For quantitative analysis of these results, we utilize the SEG score from Ref. 11, which compares each true region with the identified region of highest overlap. We find this metric to be the most indicative of performance by eye, although many others are commonly used7,11. For each true region \(R_i\), a ‘Jaccard index’ is calculated with the Bellybutton-generated region \(B_i\) of highest overlap, by dividing the area of their intersection by the area of their union. True regions that do not have an intersection of at least one half of their area are given a score of 0. The SEG reported is the average of all such scores for a given dataset, with a perfect score being 1. A detailed explanation of the calculation can be found at celltrackingchallenge.net. Bellybutton was reliably able to beat a 0.9 SEG score on the test set for this data.

In the highlighted example the entire training set was used, and the network was trained for \(E=2\) epochs (each training data point was shown to the network twice). For practical use however, it may not be necessary to use even this much data (half of three images), as shown in Fig. 2E. A sub-sampling option is given as a parameter in the Bellybutton package, named ‘fraction.’ This value indicates the fraction (0-1] of available training pixels that the algorithm will use to train the neural network. For values below 1, individual pixels are randomly chosen, but at a rate that ensures that innies and outies are equally represented. (This can also be modified easily via the parameters of the algorithm, to instead represent innies and outies in the ratio they are present in the images). We find that accuracy for a variety of problems is dependent on the quantity \(EF = T/M\) being sufficiently high, where E is the number of epochs in training, M is the size of the total training set, F is the fraction of the training set that is used, and \(T=EFM\) is the total number of training steps. This dependency is shown by the data collapse in Fig. 2E. As a result, smaller data fractions F can be used to suss out the tractability of a problem. In this example, even tiny fractions of the training data can still yield passable results, including a total training set corresponding to only 0.5% of each training image (F = 0.01, as half of the image is for testing), as seen by the modest dependence of SEG on data fraction in Fig. 2F. Further, the small variance in test results for F = 0.01 suggest that for this dataset our algorithm is robust to variation in training data and thus the various sources of noise that plague these images: lighting, focus, and particle-size variation, etc. However for optimal results, a larger fraction of the data must be used, to give the network access to a wider variety of examples. Overall, more data is typically better, but we often find that \(F\ge 0.1\) gives reasonable results for systems with many repeated particles, like the one shown in Fig. 2. An important caveat is that these training data should be taken from a sufficiently varied set of images and locations within those images to encompass the range of the desired data set.

Bellybutton is also useful for structure-finding. In the following example a lattice of laser-cut acrylic (Polymethyl methacrylate or PMMA) is slowly fractured while lit between cross-polarizers to reveal changes in internal stress. These changes to the material’s structure affect its brightness, shown in Fig. 3A, E, make it difficult to track algorithmically. Using just three training images with human-generated masks (Fig. 3B), Bellybutton is capable of tracking the fracturing structure on unseen test images (Fig. 3C, D) through time, as shown in Fig. 3E, F, despite lighting and focus changes, as shown for a zoomed in portion in Fig. 3E, F. While all images in this example broadly similar, the important information about the system—the lattice structure—changes significantly, indicated only by subtle shifts in edge locations and lighting (e.g. the breakage at the bottom of the images in Fig. 3E, F between t=300 and t=400). This makes hard-coded algorithmic detection difficult, and extensive human labeling time-consuming. We note that our package includes options for a binarized innie vs outie output, or a scalar distance-to-edge output, the latter of which is shown in Fig. 3F. This option can be helpful for skeletonizing a structure, and to suppress noise and error.

Figure 2
figure 2

3D Printed Photoelastic Disks (A) Images of a 3D printed photoelastic material in the shape of a granular packing under three stress states (high, medium, low). Each was divided into four sections, two of which (gray) were used for training, and two (purple) were used for evaluation. A single network was trained using all six training regions, and tested on all six test regions. (B) Zoom in on orange-framed region in (A). Note the variety of lighting patterns on each disk. Teal and blue superimposed squares are the image scales fed into the network for this task. (C) User-generated masks for these zoomed in regions (which are part of the test set). (D) Final segmentation output for the zoomed in region. Note that the colors serve to differentiate regions; there is no attempt to match the colors between (C) and (D). (E) SEG score for the test set as a function of Epochs times Training Fraction EF. Training fraction F is denoted by color, and is the portion of the training data used in training the network, with each data point shown to the network once per epoch. SEG score is an indicator of segmentation quality, and is calculated by dividing the intersection of generated regions and their corresponding true regions with their union, and averaging for all true regions (see text for further explanation). (F) SEG score for all runs with \(EF\ge 3\) as a function of data fraction F. Note the diminishing returns on this task for high F.

Conclusion: how and when to use Bellybutton

In summary, to use Bellybutton a user supplies images and labels (masks). Bellybutton then automatically converts these into a format digestible by its CNN, including augmenting the data to aid training (e.g. adding rotated versions) and class-balancing, and trains using user-defined parameters. Then, the algorithm produces pixel-level predictions for each image, which are (automatically) spatially assembled and converted into segmented maps through a watershedding algorithm. Users may elect to have the algorithm save the CNN outputs themselves, the ‘innie’ vs ‘outie’ classification and/or the distance map, as is helpful in some cases (such as Fig. 3).

Figure 3
figure 3

Tracking a changing structure with Bellybutton (A) Training images of a fracturing lattice. Image contrast and brightness have been enhanced, and the top 2/3 of each image is shown. Note that these are the only training images, but that we have spread them out in time to encompass a wide range of situations. (B) Binary mask for the third training image with superimposed area of interest (gray). (C) Example test image and (D) accompanying Bellybutton-generated distance map output. Orange square denotes ___location of zoomed regions in (E) and (F). (E) Zoomed in (enhanced) images with (F) corresponding Bellybutton-generated distance map for many time steps.

We have tried to make Bellybutton as accessible as possible. It is downloadable as a python package, which can be easily installed with one command, and utilizing Bellybutton requires no coding. Instructions for use, details for how to customize training and hyper-parameters, and much more can be found at pypi.org/project/bellybuttonseg. For Python-savvy users, the code itself and a Jupyter Notebook version is also available at github.com/sdillavou/bellybuttonseg. Starting a project is as simple as running a single command, and Bellybutton creates a folder structure to add images, masks, and areas of interest. Adjusting the parameters of training and testing are done through editing an automatically-generated text file. Furthermore, we have provided the three data sets used the figures of this work as example projects that can be downloaded in one command, set up, and run on a laptop. Deploying one of these example projects takes under a minute, plus training time (computer dependent).

While just three examples of Bellybutton’s potential uses are shown, its flexibility should make it useful in a wide variety of situations. For example, regions are not limited to single particles; masks might specify the two connected regions of a dimer, or a disk and a mark on its surface indicating its rotational position as separate regions, allowing them both to be segmented simultaneously. The same approach could be applied to a cell and its nucleus, an insect and its head or feet, a particle and its previous position, allowing velocity to be approximated from single images. Regions can be used to identify particle classes as well; segmenting only particles of a given shape, size, or orientation will prompt Bellybutton to do the same. A broad rule of thumb is if a region is easily identifiable by eye, it is a good candidate for Bellybutton. This class of image segmentation problems is both frustrating and common in research, and we believe giving users an easy-to-use but flexible method like Bellybutton will save countless hours in the lab.