Denormalising to improve performance? It sounds convincing, but it doesn't hold water.
Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.
I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.
He found that:
- Some of them hold for special cases
- All of them fail to pay off for general use
- All of them are significantly worse for other special cases
It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.
Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.
Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.
Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.
It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.
- There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
- There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
- A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
- One of the arguments of the comparison is an expression (no index)
Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.
If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.
For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.
You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.
I'd also like to respond to
Joins are just cartesian products with some lipgloss
What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.
The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B
Notes
David Aldridge provides some important additional information.
There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.
A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.
I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.
"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.
The reason that I responded so savagely was that the statement as worded says that
Joins are cartesian products...
This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.
I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.
The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.
Sometimes denormalisation is a good idea
Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.
You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.
A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).
An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.
Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?
Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.
ADO.NET problem with type mismatches
Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.
VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.
"Count the disk hits" (Rick James)
If everything is cached in RAM, JOINs
are rather cheap. That is, normalization does not have much performance penalty.
If a "normalized" schema causes JOINs
to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.
Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
Best Answer
The way I look at it, a relational database is a general purpose tool to hedge your bets. Modern computers are fast enough, and RDBMS' are well-optimized enough that you can grow to quite a respectable size on a single box. By choosing an RDBMS you are giving yourself very flexible access to your data, and the ability to have powerful correctness constraints that make it much easier to code against the data. However the RDBMS is not going to represent a good optimization for any particular problem, it just gives you the flexibility to change problems easily.
If you start growing rapidly and realize you are going to have to scale beyond the size of a single DB server, you suddenly have much harder choices to make. You will need to start identifying bottlenecks and removing them. The RDBMS is going to be one nasty snarled knot of codependency that you'll have to tease apart. The more interconnected your data the more work you'll have to do, but maybe you won't have to completely disentangle the whole thing. If you're read-heavy maybe you can get by with simple replication. If you're saturating your market and growth is leveling off maybe you can partially denormalize and shard to fixed number of DB servers. Maybe you just have a handful of problem tables that can be moved to a more scalable data store. Maybe your usage profile is very cache friendly and you can just migrate the load to a giant memcached cluster.
Where scalable key-value stores like BigTable come in is when none of the above can work, and you have so much data of a single type that even when it's denormalized a single table is too much for one server. At this point you need to be able to partition it arbitrarily and still have a clean API to access it. Naturally when the data is spread out across so many machines you can't have algorithms that require these machines to talk to each other much, which many of the standard relational algorithms would require. As you suggest, these distributed querying algorithms have the potential to require more total processing power than the equivalent JOIN in a properly indexed relational database, but because they are parallelized the real time performance is orders of magnitude better than any single machine could do (assuming a machine that could hold the entire index even exists).
Now once you can scale your massive data set horizontally (by just plugging in more servers), the hard part of scalability is done. Well I shouldn't say done, because ongoing operations and development at this scale are a lot harder than the single-server app, but the point is application servers are typically trivial to scale via a share-nothing architecture as long as they can get the data they need in a timely fashion.
To answer your question about how commonly used ORMs handle the inability to use JOINs, the short answer is they don't. ORM stands for Object Relational Mapping, and most of the job of an ORM is just translating the powerful relational paradigm of predicate logic simple object-oriented data structures. Most of the value of what they give you is simply not going to be possible from a key-value store. In practice you will probably need to build up and maintain your own data-access layer that's suited to your particular needs, because data profiles at these scales are going to vary dramatically and I believe there are too many tradeoffs for a general purpose tool to emerge and become dominant the way RDBMSs have. In short, you'll always have to do more legwork at this scale.
That said, it will definitely be interesting to see what kind of relational or other aggregate functionality can be built on top of the key-value store primitives. I don't really have enough experience here to comment specifically, but there is a lot of knowledge in enterprise computing about this going back many years (eg. Oracle), a lot of untapped theoretical knowledge in academia, a lot of practical knowledge at Google, Amazon, Facebook, et al, but the knowledge that has filtered out into the wider development community is still fairly limited.
However now that a lot of applications are moving to the web, and more and more of the world's population is online, inevitably more and more applications will have to scale, and best practices will begin to crystallize. The knowledge gap will be whittled down from both sides by cloud services like AppEngine and EC2, as well as open source databases like Cassandra. In some sense this goes hand in hand with parallel and asynchronous computation which is also in its infancy. Definitely a fascinating time to be a programmer.