Is Normalization Enough for Data Integrity and Quality Reporting?

database-designnormalization

I've gotten out into the real world recently, and for the first time have had to really think about database design, as I've been developing across the entire Java stack. With that I've realized that I don't fully understand the data-store side of things in terms of design, and I haven't seen much of the relationship between data-stores and reporting.

When I got started in my programming diploma we were taught the basics of database design, such as the importance of normalization and naming conventions, but I've never done much of this in the real world. So I wonder:

Is a correctly normalized database the only consideration when attempting to deliver a data store with solid data integrity, and that is easily reported on?

and

If so, does more normalization tend to lead to more reportable data?

Best Answer

There are two aspects to the question, and each have slightly different concerns.

Is a correctly normalized database the only consideration when attempting to deliver a data store with solid data integrity, and that is easily reported on?

No. Data integrity also demands constraints.

  • Primary key constraints uniquely identify a record. This helps guard against duplicates, but does not necessarily prevent them.

  • Foreign key constraints help ensure that related data is kept in sync: a table for "customer phone numbers" should have a corresponding "customer," for example. Orphaned records and missing data harm data integrity.

  • Field/column constraints can help ensure data is valid. For example, perhaps a phone number is stored in a VARCHAR field but should not store letters or formatting, only numbers. A constraint can guarantee that if the data exists, it meets arbitrary criteria that make it valid for the given schema.

If so, does more normalization tend to lead to more reportable data?

Normalization tends to lead to less reportable data. The reason is that a typical RDBMS schema is designed around ORM, meaning "application objects." What looks like one object may require several tables:

  • A class that uses inheritance (i.e. has subclasses) requires one table per inheritance level in practice because child data members are not applicable to the superclass and should have their own table.

  • Related objects may have their own table. A customer with multiple phone numbers may be List<String> in the application, but the phone numbers may be in their own table in the schema forming a 0..* relationship.

Reports often are record based where a record is specific to the report. They often denormalize data to give a view of a specific table with related data mixed in. This is normally at odds with normalization and ORM.

What this means is that an object that you use quite easily in the application may explode into many table in the database schema, adding relationships with varying cardinality. This requires joins when writing a report query, some of which may be complex or require subqueries. I have seen report SQL queries with ten or more joins, correlated subqueries, aggregates, and other intermediate to advanced SQL features which add complexity and can harm query performance.


The typical way to deal with this as I have seen professionally is to have a separate set of denormalized reporting tables built for your reports. Use triggers or stored procedures to populate them. This is more work during persistence, but saves a lot of time and hair pulling when writing SQL for your reports.

You can also use application code: when you save an object and have it and its related objects in memory, construct a query to insert or update a record into your reporting tables. This may be easier and have faster run-time performance than relying on triggers.

Related Topic