I have been working on a project and running into a very difficult problem.
The problem can be stated simply as how to unit-test numerical algorithms.
However if you just took this simple statement you would get the standard set of answer isolation of input, random processes, mocking, tolerancing, etc.
The answer (if there is one) does not reside with these.
Also, as a note, I am not concerned with testing the correctness of the algorithm. That is done with very different methods and it is assumed that the algorithm is correct. What I am using unit-testing for is to make sure that changes in other parts of the system/support algorithms do not change the output of the algorithm under test.
So, for a description of the problem:
We have an algorithm which takes Gigabits of data, processing it, and generates an image with a corresponding map of which pixels are good and which ones are not. It is this map which I need to verify does not change and pixels which are not valid are not to be used in any way other than the fact that they invalid.
If this was all that was given I would have many possible solutions and in fact the method have been using for years is to look at the number of valid pixels, the average, standard deviation, peak and valley of the image. This has proven very sensitive detector for changes.
But, this year has added complications which is causing issues with this approach.
- The number of threads being used has increased causing changes in the floating point behavior. In the past the smaller number of thread was accounted for in the tolerances used.
- We have been using floating point (as compared to double) to increase throughput. This has caused the rounding behavior to change even more and combined with the number of threads makes it worse. This has also been made worse as we move to using SIMD processing like GPU and/or AVX instructions sets.
- We have been enhancing the algorithms used to recover more of the pixels which would have been marked as invalid in the past are not valid. This combined with 1 and 2 above is resulting in some pixels being included on some machine and not on others. Since these are on the edge of useability these difference are handled by the algorithms with use this image and are not of concern for the system as a whole, but are causing problems with unit testing the algorithms.
- We have also recently ran into the problem that two numbers (internal to the calculation) are the same and because of the rounding behavior one or the other is picked resulting in different outputs. . Again this is not a problem of the system as a whole, only with unit testing to flag for differences.
- As a minor side example, we also have other non-linear algorithms with due to rounding behavior can give different outputs. Since these algorithms are self correcting the final solution is good, but has a higher variability than easily accounted for unless the tolerances are opened larger than desired.
So, my questions is how to account for these issue in unit testing.
I strongly need unit testing because we have a team of programmers working on the system and we need the ability to detect when someone makes a change which unknowingly impacts other algorithms. Yes this has happened in the past and the method being used at the time caught the change before it reached the final product. As the primary algorithm developer I can use your help. I have had many ideas, but they don’t work for all cases and it’s not clear when to use one of the different methods.
Edit: I now that this is frequently called regression testing but I also know that many time it is also generically called unit testing by many people, even if it's not completely correct.
I am not necessarily looking for a magic bullet (that would be nice) but I do need a way of handling these issues in a way which can be handled by others in the group which may not be as knowledgeable on the details. If that is a “pick” one of these methods that is fine if it the rule are clear. We do adjust the parameters when we change the algorithms, but once it is fixed the floating point behavior is still causing differences. When we see differences we always investigate why and in general we try to adjust the algorithm to reduce the difference, but there are times where these is not possible because of the additional computation time. If we stayed away from the limits this would not be an issue (as in the past) but to advance the technology we need to push the limits giving rise to the issues described.
You can see this in a simple example. Calculating an average of many numbers. If you stay with number which are close to each other, there are simple methods to make sure the results are accurate. Now when you allow for numbers which have large (orders of mag) difference, with threading, float compared to double, etc you can get very different results. Yes I know there are algorithms which can be used to fix these issue but they can be very computation expensive. While averaging is not my problem you could imagine that the difference don’t impact downstream processing but you still want to write a test to monitor the calculation.
First of all, it is not very clear if you compare floating point numbers with exact values in your unit tests. If you do so, this is wrong and you should rely on approximate comparison of floating point numbers .
That being said, if your computations are sensitive to factors such as the number of threads involved, the precision of numbers used and the other things you describe, then you should question the stability of your numerical procedures. If you are lucky, you can rewrite some calculations to improve the stability without significantly change the number of operations required. But it might also happen, that writing better procedures involves more calculations, so that you will have to find an acceptable trade-off between time and precision.
 For instance, you can start here: http://www.gnu.org/software/gsl/manual/html_node/Approximate-Comparison-of-Floating-Point-Numbers.html