Lazy Connected Components

As you may or may not know, I am currently contributing to a fine piece of software named lazyflow, which is a python framework for data flow. The most basic elements in use are so-called Operators, which are an abstraction for various image processing steps, like computing an edge image or thresholding. These operators are chained together to form a Graph, which represents the whole image processign pipeline.

This seems straightforward, but bear with me - the interesting stuff comes with the lazy part. In our business we are not satisfied by merely analyzing some Facebook pics — the call for big data is heeded in bio-imaging as well. Consider a random insects brain, which could be about 6mm^3). If we scan this brain, slice by slice, with pixels of a size about 500nm\^3 we get a cube with the side length of approximately 212000 pixels. Even if we were using just 8 bit grayscale data, we would have 8.5 Petabyte at our hands — a few more Blu-Ray disks than I can fit into my cupboard (350000 more, to be precise). The world’s most powerful supercomputer has 1 Petabyte memory, for comparison, so we need to chop that huge volume into tiny pieces and apply our algorithms on each of them, That’s where lazyflow comes into play. With the chained operators you just need to demand a specific ROI from the end of the chain, and only the data required for that ROI will be processed in all steps. Some operators, like image filters, need a halo around the actual volume to calculate the results precisely, and thus the ROI tends to grow as it is passed on to the very first operator in the graph.

A real showstopper in this graph layout would be an operator that does not only need a simple halo around the requested ROI, but which depends on the whole volume to be analyzed at once. This would keep us from analyzing the poor insects brain as long as we have no supercomputer that can handle our volume at once.

You might have already guessed it: this operator exists, right now, in the lazyflow framework. It’s the connected components operator, aka labeling operator, which I had the pleasure to rewrite some weeks ago. But don’t blame me for it - it turns out that labeling an image is inherently a global procedure.

The process called labeling is a key component in image processing. Let’s say you have an image of a cake, like that one below (source: Wikipedia-user Mikelo).

cake_small{.size-thumbnail .wp-image-88 width=”150” height=”150”})

You desperately want to track the cherries on the cake, for reasons beyond my imagination. You train a classifier that separates cherries from non-cherries and gives you a probability map instead.

cherries_pred{.size-thumbnail .wp-image-90 width=”150” height=”150”})

A reasonable next step would be to say ‘Ok, I will assume that all regions where the classifier’s certainty is above 90% are cherries.’ and produce a binary cherry map. I admit it does not look that different from the prediction, but that just means that the classifier is very confident about its predictions.

cherries_thresh{.size-thumbnail .wp-image-91 width=”150” height=”150”})

Now you’ve got all cherries nicely labeled. But wait — we wanted to track single cherries, which means that we need to assign each cherry some unique id. That’s what the labeling operator is supposed to do. In a nutshell, it identifies the connected components (pixels in a neighbourhood that have the same gray value), and labels each component with a unique id.

cherries_labeled{.size-thumbnail .wp-image-98 width=”150” height=”150”})

To make it unique over the whole volume, you need to examine the whole volume, or you end up with repeating ids or cherries that are split among subimages. Consider this worst-case (source: Flickr user squamatologist):

snake{.size-thumbnail .wp-image-92 width=”150” height=”150”}

The snake is clearly separable from the background, but the one snake-object extends over the whole volume. This is why the operator naively assumes that it always needs the whole volume to calculate the ids.

Of course there are already some ways around the memory problems. One way to do it is to analyze the volume block by block first, and merge the duplicate labels afterwards. This is done in Thorben’s repo blockedarray, for which I wrote some python wrappers and an adaptation to the vigra library. The process itself remains global, though.

For the cases where we don’t have a snake wiggling through the image, but cherries that can be operated on blockwise, there is still hope. We can start with the block in question, and see if there are objects touching the boundary. If yes, we extend our ROI and try again. In the worst case this procedure is still global, but on average we can get away with much less computation. This would enable some nice advanced usage of lazyflow, like real object detection, which in general uses large datasets and requires the labeling operator.

This problem needs to be solved soon for projects like ilastik to become really powerful, and I am looking forward to contributing my solutions. In fact, I applied for a GSoC participation and proposed this topic to lmonade, which is the organization that hosts the lazyflow projects. Stay tuned for more merry tales from the world of scientific python programming!