RDBMS Design – Why RDBMSes Don’t Return Joined Tables in Nested Format

designrdbmssql

For example, say I want to fetch a User and all of his phone numbers and email addresses. The phone numbers and emails are stored in separate tables, One user to many phones/emails. I can do this quite easily:

SELECT * FROM users user 
    LEFT JOIN emails email ON email.user_id=user.id
    LEFT JOIN phones phone ON phone.user_id=user.id

The problem* with this is that it's returning the user's name, DOB, favorite color, and all the other information stored in the user table over-and-over again for each record (users×emails×phones records), presumably eating up bandwidth and slowing down the results.

Wouldn't it be nicer if it returned a single row for each user, and within that record there was a list of emails and a list of phones? It would make the data much easier to work with too.

I know you can get results like this using LINQ or perhaps other frameworks, but it seems to be a weakness in the underlying design of relational databases.

We could get around this by using NoSQL, but shouldn't there be some middle ground?

Am I missing something? Why doesn't this exist?

* Yes, it's designed this way. I get it. I'm wondering why there isn't an alternative that is easier to work with. SQL could keep doing what it's doing but then they could add a keyword or two to do a little bit of post-processing that returns the data in a nested format instead of a cartesian product.

I know this can be done in a scripting language of your choice, but it requires that the SQL server either sends redundant data (example below) or that you to issue multiple queries like SELECT email FROM emails WHERE user_id IN (/* result of first query */).


Instead of having MySQL return something akin to this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "johnsmith45@gmail.com",
    },
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "john@smithsunite.com",
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "email": "originaljane@deerclan.com",
    }
]

And then having to group on some unique identifier (which means I need to fetch that too!) client-side to reformat the result set how you want it, just return this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "emails": ["johnsmith45@gmail.com", "john@smithsunite.com"]
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "emails": ["originaljane@deerclan.com"],
    }
]

Alternatively, I can issue 3 queries: 1 for the users, 1 for the emails, and 1 for the phone numbers, but then the email and phone number result sets need to contain the user_id so that I can match them back up with the users I previously fetched. Again, redundant data and needless post-processing.

Best Answer

Deep down, in the guts of a relational database, its all rows and columns. That is the structure that a relational database is optimized to work with. Cursors work on individual rows at a time. Some operations create temporary tables (again, it needs to be rows and columns).

By working with only rows and returning only rows, the system is able to better deal with memory and network traffic.

As mentioned, this allows for certain optimizations to be done (indexes, joins, unions, etc...)

If one was to want a nested tree structure, this requires that one pulls all the data at once. Gone are the optimizations for the cursors on the database side. Likewise, the traffic over the network becomes one big burst that can take much longer than the slow trickle of row by row (this is something that is occasionally lost in today's web world).

Every language has arrays within it. These are easy things to work with and interface with. By using a very primitive structure, the driver between the database and program - no matter what language - can work in a common way. Once one starts adding trees, the structures in the language become more complex and more difficult to traverse.

It isn't that hard for a programing language to convert the rows returned into some other structure. Make it into a tree or a hash set or leave it as a list of rows that you can iterate over.

There is also history at work here. Transferring structured data was something ugly in the days of old. Look at the EDI format to get an idea of what you might be asking for. Trees also imply recursion - which some languages didn't support (the two most important languages of the old days didn't support recursion - recursion didn't enter Fortran until F90 and of the era COBOL didn't either).

And while the languages of today have support for recursion and more advanced data types, there isn't really a good reason to change things. They work, and they work well. The ones that are changing things are the nosql databases. You can store trees in documents in a document based one. LDAP (its actually oldish) is also a tree based system (though its probably not what you're after). Who knows, maybe the next thing in nosql databases will be one that returns back the query as a json object.

However, the 'old' relational databases... they're working with rows because thats what they're good at and everything can talk to them without trouble or translation.

  1. In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.

From RFC 1925 - The Twelve Networking Truths