Yay, I got accepted into the Google Summer of Code project ‘Lazy Connected Components’! The official start of development will be on May 17, but I have lectures to visit and excercises to do, so I decided to extend the community bonding phase and do some real development. In fact I feel quite bonded to the HCI already, so I guess this is fine.

For my first steps I created a repo outside ilastik / vigra so I can do some experimenting. In the latter stages of the project my code will have to be ported from there to the respective project’s git repository though.

Here’s a little sketch of what my plans are for tackling the lazy connected components problem. I am going to explain things in 2d, but the extension to arbitrary dimensions should be clear.

Suppose we got an input image X, and a user wants a ROI of that image labeled. As I have explained in the initial post, the labels should be consistent if the user requests some other ROI in the future. This means we have to make sure that all objects (connected components) which are visible in the current ROI have to be processed in this step. Consider some horse shoe shaped object: If I label the one end red, the other end independently blue, and later I find out that red and blue are actually the same. I cannot tell the user ‘Hey, I got the labeling wrong, just think of blue as red’.

We dubbed the process of extending the region such that all objects are within the extension as ‘region growing’. The algorithm would look something like this:

Given a labeled chunk and a set of labels to finalize (initially this would be all labels in the chunk)

  • for every adjacent chunk
    • label classically
    • find objects with labels to be finalized that are on the chunk boundary
    • merge the global labels for these objects
    • repeat from the beginning with the neighbour chunk and the labels that where merged

Someone could state that this looks pretty much like some classical connected components algorithm, and indeed the only part that differs is the ‘keep track of labels’ part. If we look at the problem from a graph theory point of view (that’s where the term ‘connected components’ originated, I guess), we can interpret the image as a grid graph and compute spanning trees. One could argue that the problem is essentially the same for chunks. We can take each chunk as a node and compute spanning trees for them as well. The difference is that connectedness in terms of chunks is not transitive: If ROI A is connected to ROI B via object 1 and ROI B is connected to ROI C via object 2, A and C are in general not connected. That’s why we need to keep a list of labels that we are actually interested in, or we would most likely label the whole image every time.

There is already a good amount of python code and tests that implements this strategy in my repo. The probem is split into these parts:

  • classical labeling of chunks
  • merging of adjacent chunks
  • region growing

The original idea was that only the latter should be implemented in python, but as it turns out the merge function is a three-liner which can be handled with numpy quite efficiently. Which means that the lazyflow operator will be all python (except the UnionFind structure from vigra). For the later stages of the project, Ulli and I agreed on implementing a vigra version of the lazy connected components algorithm which will make use of the new ChunkedArrays. And it will be implemented in terms of general graphs, which is a prospect that I really enjoy!