Unit Testing – Where to Store Test Data

datatestingunit testing

I have smaller unit tests that use small snippets from real data sets. I would also like to test my program against full data sets for a multitude of reasons. The only problem is that a single real dataset is about ~5GB. I haven't found any hard numbers for what Git repositories can store but that seems like too much.

(According to this Programmers post I should keep all of my data needed to test the project in the repository.)

The solution that my team has adopted is that the project has a file that contains a path to a network attached file system that holds our test data. The file is Git ignored.

I feel like this is an imperfect solution for two reasons. When the NAS isn't working, is slow, or is down than we can't run a full test. The second reason is that when someone first clones a repository the unit tests fail so they have to figure out how to mount things with a certain name and the syntax used to build the testing path file.

So my question is two fold. How much data is too much data to store in revision control?

What is a better way to handle large amounts of test data?

Best Answer

How to handle large files in a build chain

I like to use a build tool that does dependency management - such as maven or gradle. The files are stored in a web repository, and the tool takes care of downloading and caching automagically when it encounters the dependency. It also eliminates extra setup (NAS configuration) for people who want to run the test. And it makes refreshing the data fairly painless (it's versioned).

What's too big to put in revision control

There is a large gray area. And if you decide something doesn't belong in a RCS, what are your alternatives? It's an easier decision if you limit your choices between the RCS and a binary repo (maven style).

Ideally, you'd only want in the RCS stuff that is humanely editable, diffable, or where you'd want to track the history. Anything which is the product of a build or some other sort of automation definitely doesn't belong there. Size is a constraint, but not the main one - a giant source file (bad practice) definitely belongs in the source control. A tiny compiled binary doesn't.

Be ready to compromise for developer convenience.

Related Topic