Continuous Integration for Scientific Software – Best Practices

ccontinuous integrationobject-orientedtesting

I'm no software engineer.
I'm a phd student in the field of geoscience.

Almost two years ago I started programming a scientific software. I never used continuous integration (CI), mainly because at first I didn't know it exists and I was the only person working on this software.

Now since the base of the software is running other people start to get interested in it and want to contribute to the software. The plan is that other persons at other universities are implementing additions to the core software. (I'm scared they could introduce bugs). Additionally, the software got quite complex and became harder and harder to test and I also plan to continue working on it.

Because of this two reasons, I'm now more and more thinking about using CI.
Since I never had a software engineer education and nobody around me has ever heard about CI (we are scientists, no programmers) I find it hard to get started for my project.

I have a couple of questions where I would like to get some advice:

First of all a short explanation of how the software works:

  • The software is controlled by one .xml file containing all required settings. You start the software by simply passing the path to the .xml file as an input argument and it runs and creates a couple of files with the results. One single run can take ~ 30 seconds.

  • It is a scientific software. Almost all of the functions have multiple input parameters, whose types are mostly classes which are quite complex. I have multiple .txt files with big catalogs which are used to create instances of these classes.

Now let's come to my questions:

  1. unit tests, integration tests, end-to-end tests?:
    My software is now around 30.000 lines of code with hundreds of functions and ~80 classes.
    It feels kind of strange to me to start writing unit tests for hundreds of functions which are already implemented.
    So I thought about simply creating some test cases. Prepare 10-20 different .xml files and let the software run. I guess this is what is called end-to-end tests? I often read that you should not do this, but maybe it is ok as a start if you already have a working software? Or is it simply a dumb idea to try to add CI to an already working software.

  2. How do you write unit tests if the function parameters are difficult to create?
    assume I have a function double fun(vector<Class_A> a, vector<Class_B>) and usually, I would need to first read in multiple text files to create objects of type Class_Aand Class_B. I thought about creating some dummy functions like Class_A create_dummy_object() without reading in the text files. I also thought about implementing some kind of serialization. (I do not plan to test the creation of the class objects since they only depend on multiple text files)

  3. How to write tests if results are highly variable? My software makes use of big monte-carlo simulations and works iteratively. Usually, you have ~1000 iterations and at every iteration, you are creating ~500-20.000 instances of objects based on monte-carlo simulations. If only one result of one iteration is a bit different the whole upcoming iterations are completely different. How do you deal with this situation? I guess this a big point against end-to-end tests, since the end result is highly variable?

Any other advice with CI is highly appreciated.

Best Answer

Testing scientific software is difficult, both because of the complex subject matter and because of typical scientific development processes (aka. hack it until it works, which doesn't usually result in a testable design). This is a bit ironic considering that science should be reproducible. What changes compared to “normal” software is not whether tests are useful (yes!), but which kinds of test are appropriate.

Handling randomness: all runs of your software MUST be reproducible. If you use Monte Carlo techniques, you must make it possible to provide a specific seed for the random number generator.

  • It is easy to forget this e.g. when using C's rand() function which depends on global state.
  • Ideally, a random number generator is passed as an explicit object through your functions. C++11's random standard library header makes this a lot easier.
  • Instead of sharing random state across modules of the software, I've found it useful to create a second RNG which is seeded by a random number from the first RNG. Then, if the number of requests to the RNG by the other module changes, the sequence generated by the first RNG stays the same.

Integration tests are perfectly fine. They are good at verifying that different parts of your software play together correctly, and for running concrete scenarios.

  • As a minimum quality level “it doesn't crash” can already be a good test result.
  • For stronger results, you will also have to check the results against some baseline. However, these checks will have to be somewhat tolerant, e.g. account for rounding errors. It can also be helpful to compare summary statistics instead of full data rows.
  • If checking against a baseline would be too fragile, check that the outputs are valid and satisfy some general properties. These can be general (“selected locations must be at least 2km apart”) or scenario-specific, e.g. “a selected location must be within this area”.

When running integration tests, it is a good idea to write a test runner as a separate program or script. This test runner performs necessary setup, runs the executable to be tested, checks any results, and cleans up afterwards.

Unit test style checks can be quite difficult to insert into scientific software because the software has not been designed for that. In particular, unit tests get difficult when the system under test has many external dependencies/interactions. If the software is not purely object-oriented, it is not generally possible to mock/stub those dependencies. I've found it best to largely avoid unit tests for such software, except for pure math functions and utility functions.

Even a few tests are better than no tests. Combined with the check “it has to compile” that's already a good start into continuous integration. You can always come back and add more tests later. You can then prioritize areas of the code that are more likely to break, e.g. because they get more development activity. To see which parts of your code are not covered by unit tests, you can use code coverage tools.

Manual testing: Especially for complex problem domains, you will not be able to test everything automatically. E.g. I'm currently working on a stochastic search problem. If I test that my software always produces the same result, I can't improve it without breaking the tests. Instead, I've made it easier to do manual tests: I run the software with a fixed seed and get a visualization of the result (depending on your preferences, R, Python/Pyplot, and Matlab all make it easy to get high-quality visualizations of your data sets). I can use this visualization to verify that things did not go terribly wrong. Similarly, tracing the progress of your software via logging output can be a viable manual testing technique, at least if I can select the type of events to be logged.