Data Warehousing – Common Mistakes and How to Avoid Them

data-warehouse

I work for a company that has several products, and several databases and tons of data.

To sorta merge all this data what we've been doing moving some of our data over to Amazon Redshift (a data warehouse). We store as much as we can here (everything very de-normalized). We consider redshift our 'source' data, as we've changed a lot of procedures to pretty much just bulk insert straight to the data warehouse before going to any other database first.

The part I'm having a hard time understanding is I thought a data warehouse was suppose to help with reporting.

With Amazon Redshift, they have disabled stored procedures, recursive ctes, user defined functions and a bunch of other things..which makes it sorta tough to report on data.

What we do in a lot of situations, is export the data from redshift into a different database, just so we can run some reports (not all the data, we just export the data that is relevant to the report).

Some other reports that are run we have set up programs that dynamically produce the sql and send in the ad-hoc query to get the data we need, these programs become a pretty good mess after too much modification (concatenating strings to make the sql, hard to test without just running the program ect)

This all seems sorta wrong, are we using the concept of data warehousing wrong?

Best Answer

Instead of thinking of data warehouse as a warehouse (store as much as we can, very denormalised), to get the benefit you need to think about what benefit you want to get out of it. Basically, in data warehouse, instead of having just data, you have Facts and Measures (Dimensions). After you build these Facts and Dimensions, you will get what's called a snow flake schema, i.e. not very normalised.

If you have reports that rely on stored procedures, user defined functions, etc. you probably need to update them to take advantage of the data warehouse schema. Done correctly, reports should only need some joins and maybe special calculations which are stored in some libraries, not in database.

Another way to use data warehouse is using OLAP libraries, it's like SQL for data warehouse. OLAP queries work on data warehouse Cube, which is collection of Facts and Dimension. OLAP libraries will translate them to SQL queries.

Also, it's probably easier to understand how to build data warehouse using ETL tools before start storing data straight to data warehouse so you can understand what transformations would be needed. In the end, it should be easy to get reports and analytics out of your data warehouse using off the shelf applications.

Related Topic