TDD for batch processing: How to do it

tddtesting

I like "red/green/refactor" for RoR, etc. just fine.

My day job involves batch processing very large files from third-parties in python and other custom tools.

Churn on the attributes of these files is high, so there are a lot of fixes/enhancements applied pretty frequently.

Regression testing via a known body of test data with expected results does not exist. Closest thing is running against the last batch with new test cases hand coded in, make sure it does not blow up, then apply spot-checking and statistical tests to see if data still looks OK.

Q>> How to bring TDD principles into this kind of environment?

Best Answer

Just an FYI: Unit testing is not equivalent to TDD. TDD is a process of which unit testing is an element.

With that said, if you were looking to implement unit testing then there's a number of things you could do:

All new code/enhancements are tested

This way you don't have to go through and unit test everything that already exists, so the initial hump of implementing unit testing is much smaller.

Test individual pieces of data

Testing something that can contain large amounts of data can lead to many edge cases and gaps in the test coverage. Instead, consider the 0, 1, many option. Test a 'batch' with 0 elements, 1 element and many elements. In the case of 1 element, test the various permutations that the data for that element can be in.

From there, test the edge cases (upper bounds to the size of individual elements, and quantity of elements in the batch). If you run the tests regularly, and you have long running tests (large batches?), most test runners allow categorization so that you can run those test cases separately (nightly?).

That should give you a strong base.

Using actual data

Feeding in 'actual' previously used data like you're doing now isn't a bad idea. Just complement it with well formed test data so that you immediately know specific points of failure. On a failure to handle actual data, you can inspect the results of the batch process, produce a unit test to replicate the error, and then you're back into red/green/refactor with useful regression cases.

Related Topic