Integration

The latest commit to https://github.com/burgerdev/gsoc2014 makes the lazy connected components operator work with arbitrary input data (as regarding dimensionality). Initially I thought that it would be a pain to change all the stuff I had carefully crafted to work with just 3d data, but it turned out not to be that bad. Most of the internals handle a thingy I called ‘ChunkIndex’ to access data, save states and so on. This index triple is used as a key to arrays and dictionaries, which made it easy to just switch to a quintuple. The only thing that really needed changing was the logic behind ‘generateNeighbours’ - time and channel neighbours are simply ignored.

With this done, there is not much keeping me from integrating the whole thing into ilastik (lazyflow, in particular). We did not decide yet when and how lazy connected components should be used in the software. It could be set as the default, but we would have to accept performance losses with small datasets and non-sparse objects. Or it could be optional, depending on what the user wants. Which would imply heavy GUI work, because the labeling operator is used almost everywhere I look, and it is not certain that ‘the user’ actually knows what he wants. The sanest way would probably be to decide automatically (i.e. hard-coded) depending on the input data.

Once the operator is in ilastik, the last thing seperating us from having a truely lazy thresholding applet is the applet itself. If I remember correctly, quite a few internal design decisions rely on labeling being a global operation, e.g.

def execute(self, slot, subindex, roi, result):
    # labeling is global anyways, do the whole input at once
    data = self.Input[...].wait()
    newdata = self._handleData(data)
    result[...] = newdata[roi.toSlice()]

The first thing that comes to mind is the ‘OpFilterLabels’ operator, which will most likely have to be rewritten in a lazy fashion.

[...]


More Dimensions

The most recent commits in master are finally thread-safe. At least I hope so. The snake is nicely labeled with a continuous yellow, and everything else seems to work smooth. I had to make some sacrifices to get this to work, though. First of all, I swapped the Vigra UnionFind with a Python one, because I want it to be thread safe - writing a wrapper for this seemed like overkill. The other problem I encountered with locks and lazyflow: when I tried to use an OpCompressedcache instead of a ChunkedArray, I ended up getting deadlocks no matter how hard I tried to find a reason for it. These deadlocks show up when launching requests from within critical operations. I asssume there must be some special functionality regarding thread management that undermines my locking policy.

But enough of the past: welcome to the future. In the future we will have more of everything - especially more dimensions. The current operator does only support 3d spatial data, which is a shame. It should be able to treat 4d and even 5d data as well!

The problem with 5d support in ilastik, although problem might be too much here, is that in principle every applet and workflow supports 5-d data, but you might run into problems if your datasets are somewhat ill-formed. And I’m not even speaking of the ambiguity that some specific axes orders show. We decided a while ago that we want to handle everything as 5d data internally, which was in principle a brilliant decision. You could write new operators and would not have to support anything but 5d txyzc, and ilastik would handle the rest. There is even a wrapping operator in lazyflow that turns old 3d operators into fully functional 5d ones.

But there’s a drawback to this. For some datasets, most of them having many time slices, the loading times went up to hours. And that is graph construction time, not calculation. The solution to this problem is also clear: write 5d operators to start with. The last operators I touched went something like this:

def execute(self, slot, subindex, roi, result):
    for c in range(nchannels):
        for t in range(ntime):
            data = self.Input[t, ..., c].wait()
            modified = self.treatData(data)
            result[t, ..., c] = modified

After a while, you memorize this pattern, and just automatically apply it everywhere. And at some point you get frustrated, because you don’t want to write double for loops any more, and procrastinate by writing blog posts.

[...]