I think we can at last call the multithreaded spot detector finished. In the last month of her Honors Thesis work, Salika Dunatunga and I labored hard to accelerate the convolution step, and it is working great. This speedup is in addition to the earlier-reported speedup from using multiple threads to parallelize the detection task. The final result is, I have to say, really nice -- we get in the neighborhood of a 20x acceleration in Bisque of the detector phase. What once took 25-30 minutes now takes about one minute.

#### Smarter convolution

Convolution is really central to the spot detector, and accelerating that portion turned out to be relatively complicated. But let me first give some context. Convolution is a fundamental signal-processing operation; in our case, we use it to deliberately blur the image. (Why? Because an appropriate amount of blur makes tiny sprinkles of noise disappear. It hides irrelevant jitter we wish to ignore.) Computationally, convolution is like a number of famous processes (for example, sorting) in that there are dumb and smart ways to do it. The original version of the software was not just single-threaded, it was doing convolution the dumb way -- the most straightforward algorithmic interpretation of the mathematical definition -- and consequently, it was slow. The abovementioned time was mostly spent on convolution.

The first smart way to convolve is to exploit the natural *separability* of the Gaussian blurring operation. For mild, light quantities of blur, this method works great. It doesn't work with all forms of convolution, but it does apply for Gaussian blur. Unfortunately, as the size of the blurring kernel increases (i.e., when you want an even more unfocused result) the computational demands grow quadratically. That's what the blue line in the graph below shows: sigma represents the "heaviness" of the blur. If you want a really heavy blur, it's not the best algorithm (in this graph, up is bad and down is good).

There's another way to perform convolution, and it revolutionized the field of signal processing when it was discovered (or, rediscovered) in 1965 by Cooley and Tukey. It turns out there is a fast way to compute something called the discrete Fourier transform. That was a revolutionary discovery because if you so "transform" an image and a convolutional kernel into their Fourier representations, some very simple (read: *fast*) math yields a signal equal to . . . the transform of the convolved image! Ok, I realize that doesn't sound very exciting, but trust me, it is a big deal. Thus, by a somewhat byzantine-seeming series of operations, one can perform convolution. The tortuous path one must follow means there is extra overhead cost, but for heavy blur it is worth it: fast Fourier transform (FFT) convolution gets more costly with more blur, but the growth rate is smaller than a separable implementation. The orange line in the above graph shows the speed of Fourier-transform-based convolution with different quantities of blur.

As you can see on the above graph, the crossover point between the two methods is around a 24-pixel value for sigma in a Gaussian blur kernel, for our sort of system inputs. *Around twenty-four pixels* -- that is a very useful quantity to know! Because when the user requests a certain amount of blur, now we can select the faster of the two available methods: is it over 24 or not? The graph itself represents a great deal of work by Salika. That 24-pixel threshold depends on a number of factors and reasonable assumptions, and it might need to be revisited one day. Even if it is not exact, anyway, in the neighborhood of 24 pixels it is fair to expect that the two methods are roughly equal-performing, and that's all the empirical evidence we need to choose a good convolution method.

#### Thread-safe FFT convolution

While Salika was investigating the above phenomena, we painfully discovered our FFT convolution code was incompatible with our parallelized code. In other words, the FFT convolver we planned to use was not thread-safe. I've written before about empirically discovering that code is thread-unsafe, and it's always a disappointment. This threatened our grand plan of choosing the best convolution method based on blur-size. I set about seeking an thread-safe alternative. What I found was that the well-known library known as FFTW *does* offer limited re-entrant behavior, but only on its transform methods. None of its support methods! So I wrapped up the support methods in a nice class, figured out a good re-entrant interface for convolution using FFTW, tested it thoroughly, and integrated it into the spot detector. This was a lot of work, but the details aren't so interesting to recount. Anyway, the current revision of the spot detector now supports multi-threading both convolution implementations, which I think is impressive. One nettlesome unanswered question concerns the crossover point: as you might have perceived, the current libFFTW-based convolution is somewhat different from the implementation used to generate the above graph -- but I trust the performance difference is modest.

The multithreaded spot detector also needed a bit of spring cleaning. The source code had just undergone some significant remodeling, and there was the software equivalent of debris scattered here and there for me to tidy up. Another analogy would be that of a book editor. The manuscript was finished, but some paragraphs were too long, some didn't make perfect sense anymore in the context of other changes, and so on. For the past few days I've been identifying trouble-spots: places where the program could fall into an infinite loop (threads are tricky this way), clunky blocks of logic to be streamlined, gaps where potential error codes could go unnoticed -- cracks in the hull, you might say. I think the program is much more water-tight and seaworthy than it was a month ago. I've been testing it continuously while making changes, of course, and it seems ready to be tried out in production.