Unit-testing – How to unit test \ use TDD methods for ETL’s and reporting projects

sqltddunit testing

ETL projects are projects created using an ETL (Extract – Transform – Load) tool such as SSIS, PowerCenter,etc

These typically involve reading data from an external source, loading it to a staging database, performing certain transformations and loading it to a final database

A simple example would be to use SSIS to read excel files provided by schoolteachers using SSIS and load them into a database. Then write stored procedures or more SSIS packages to calculate the grades of each student and load that data into a data mart\warehouse

You then build stored procedures on top of the mart to generate output which is used by reporting tools (SSRS\Excel\etc) to generate visualizations.

I'm trying to understand how to perform TDD and proper unit testing in this scenario. Tests for ETL's are mostly about ensuring the data loaded in the staging tables matches is the right subset of the data from the source. So implementing a test for it leads to implementing a mini version of the ETL. The output of the report SP's depends on the data in the tables themselves, so one cannot have a stable set of output data without a maintenance nightmare even if you do create a database containing scrubbed test data

Example:

Sprint 1 : Student table contains Name, Age, Grade

You create test data for this table, and unit tests based on that

Sprint 2: A gender field is added to the table.

Now, if you refresh the data in the student field to populate the gender attribute, the test cases are invalidated since the data changed. And if you dont, you cant create test cases that require the gender column

Best Answer

What I have done in the past is use Acceptance Test Driven Development. ETL code is often distributed across different stages/languages and technologies AND tightly coupled. Most ETL process are dependent on the sequence of transformations in the pipeline.

The risk in using unit test only in ETL is that it won't cover the integrations. The sequencing of transformations is an equal part to the actual transformations in many ETLs. If I am spending resources on creating an automated test suite I would make sure it covered the sequencing as well.

I would focus on TDD for each unique transformation sequence or at least include these tests in a larger test suite. If there are too many combinations you may have to pick and choose which sequences to test. The idea is to validate the ETL pipeline for the data set(s) it will be used on. As well as making sure you have test coverage on all your code.

Related Topic