As I mentioned last time, an undergrad and I have been working on a faster version of the spot detector. I'm glad to say that code is working really well, and offers something in the neighborhood of a 20-fold speedup on typical input datasets. We launch one thread per z-depth in a timeseries of a z-stack, and process them all in parallel. A typical dataset has around 35 or so increments in the Z direction, so we can achieve some very nice parallelism. You might be asking yourself, if you launch 35 or so threads, why not get a 35-times speedup? We can't quite achieve that because you can never parallelize everything. The big sequential factor in our case is reading the file input: the hard drive (or whatever) doesn't sprout new output cables on demand, unfortunately.
Actually, it's a little worse than that. The custom library that we use in our research group was begun in the '80s, back when parallel computing was only for the very elite. Its file IO functions employ what we call "static variables," in a thin software layer to make sure that library users are using and releasing IO resources responsibly. It's sort of a "nanny," and it really does help rapid software development to have one's library automatically check up on your application code. Unfortunately, the nanny cannot tolerate multiple threads -- the conceptual equivalent of a babysitter assuming they're supposed to watch over one kid, when in fact there are several indistinguishable clone children in the same house. Confusion would ensue.
Code like this is called "thread-unsafe," although that's a term with no strict definition. It took a surprisingly long time for us to remember that our library's file IO was not thread-safe, since (so to speak) the nanny never said "I'm confused," but instead would just occasionally, yet rarely, blow up. Human metaphors fail; the program would crash and we didn't know why. Eventually we realized we were calling thread-unsafe code when we loaded the images.
There was still a little nagging uncertainty in my mind that our program could, somehow, call other thread-unsafe code elsewhere the library, but after a few weeks of stress-testing (for example, using FAR too many threads, and running the program on ALL datasets), I'm feeling more confident that we've got it solid now.
Salika is eager to try to push the spot detector even farther, and now we are exploring a signal processing technique known as separable kernels, which might offer a performance improvement. This one is not a slam-dunk; the potential for performance improvement depends on one parameter choice from the module end user. We honestly don't know whether any speedup is possible, let alone how much. So, she is going to have to do some investigation. However, we are following the principle of optimizing the bottlenecks: the signal-processing step in question really does take a significant chunk of the overall spot-detection time. If we can speed it up by a factor of 2 (or ten or twenty) then you'll really notice the effect. I'll have to let you know what we find.