C# – Unit testing for a scientific computing library

cunit testing

I've had a bit of experience with unit testing before, in what I call (not pejoratively) the classic software engineering project: an MVC, with a user GUI, a database, business logic in the middle layer, etc. Now I'm writing a scientific computing library in C# (yeah, I know the C# is too slow, use C, don't reinvent the wheel, and all of that, but we have a lot of people doing scientific computation in my faculty in C#, and we sort of need it). It's a small project, in terms of the software development industry, because I'm writing it mostly by myself, and from time to time with help of a few colleagues. Also, I don't get paid for it, and most important, is an academic project. I mean, I expect it to have professional quality some day, because I'm planning on going open source, and hopefully with enough time it will grow a community of developers.

Anyway, the project is getting big (around 18,000 lines of code, which I think is big for a one man's project), and its getting out of my hands. I'm using git for source control, and I think I got pretty all right, but I'm testing like old school, I mean, writing full console applications that test a big part of the system, mainly because I have no idea how to do unit testing in this scenario, although I feel that is what I should be doing. The problem is that the library contains mostly algorithms, for instance, graph algorithms, classifiers, numerical solvers, random distributions, etc. I just don't know how to specify tiny test cases for each of these algorithms, and since many of them are stochastic I don't know how to validate correctness. For classification, for instance, are some metrics like precision and recall, but these metrics are better for comparing two algorithms than for judging a single algorithm. So, how can I define correctness here?

Finally there is also the problem of performance. I know its a whole different set of tests, but performance is one of the important features of a scientific tools, rather than user satisfaction, or other software engineering metrics.

One of my biggest problems is with data structures. The only test I can come up for a kd-tree is a stress test: insert a lot of random vectors and then perform a lot of random queries, and compare against a naive linear search. The same for performance. And with numerical optimizers, I have benchmark functions which I can test, but then again, this is a stress test. I don't think these tests can be classified as unit tests, and most important, run continuously, since most of them are rather heavy. But I also think that these tests need to be done, I can't just insert two elements, pop the root, and yes, it works for the 0-1-n case.

So, what, if any, is the (unit) testing approach for this kind of software? And how do I organize the unit tests and the heavy ones around the code-build-commit-integrate cycle?

Best Answer

I would say that scientific computing is actually quite well-suited for unit testing. You have definite inputs and outputs, clearly-defined pre- and postconditions that probably won't change around every other week according to some designer's whim, and no hard-to-test UI requirements.

You name some elements that might cause trouble; here's what to do about them:

  • randomized algorithms: there are two possibilities. If you actually want to test the randomizing itself, just schedule a big number of repetitions and assert that the expected proportion of cases meets the desired criterion, with big enough error margins that spurious test failures will be quite rare. (A test suite that unreliably signals phantom bugs is much worse than one that doesn't catch every conceivable defect.) Alternatively, use a configurable random source and replace the system clock (or whatever it is you use) with a deterministic source via dependency injection so that your tests become fully predictable.
  • algorithms defined only in terms of precision/recall: nothing stops you from putting in a whole set of input cases and measuring precision and recall by adding them all up; it's just a question of semi-automatically generating such test cases efficiently so that providing the test data doesn't become the bottleneck to your productivity. Alternatively, specifying some judiciously chosen input/output pairs and asserting that the algorithm computes exactly the desired input can also work if the routine is predictable enough.
  • non-functional requirements: If the specification really gives explicit space/time requirements, then you basically have to run entire suites of input/output pairs and verify that the resource usage conforms approximately to the required usage pattern. The trick here is to calibrate your own test class first, so that you don't measure ten problems with different sizes that end up being all too fast to measure, or which take so long that running the test suite becomes impractical. You can even write a small use case generator that creates test cases of different sizes, depending how fast the PU is that it's running on.
  • fast- and slow-running tests: whether it's unit or integration tests, you often end up with a lot of very fast tests and a few very slow ones. Since running your tests regularly is very valuable, I usually go the pragmatic route and separate everything I have into a fast and a slow suite, so that the fast one can run as often as possible (certainly before every commit), and never mind whether two tests 'semantically' belong together or not.
Related Topic