I would say that NULL
is the correct choice for "no email address". There are many "invalid" email addresses, and "" (empty string) is just one. For example "foo" is not a valid email address, "a@b@c" is not valid and so on. So just because "" is not a valid email address is no reason to use it as the "no email address" value.
I think you're right in saying that "" is not the correct way to say "I don't have a value for this column". "" is a value.
An example of where "" might be a valid value, separate to NULL
could be a person's middle name. Not every one has a middle name, so you need to differentiate between "no middle name" ("" - empty string) and "I don't know if this person has a middle name or not" (NULL
). There's probably many other examples where an empty string is still a valid value for a column.
C. J. Date goes into detail about this in Chapter 7 and Appendix B of SQL and Relational Theory. You're right, there's nothing in relational theory that prohibits an attribute's data type from being a relation itself, as long as it's the same relation type on every row. Your example would qualify.
But Date says structures like this are "usually--but not invariably--contraindicated" (i.e. a Bad Idea) because hierarchies of relations are asymmetric. For example, a transformation from nested structure to a familiar "flat" structure cannot always be reversed to recreate the nesting.
Queries, constraints, and updates are more complex, harder to write, and harder for the RDBMS to support if you allow relation-valued attributes (RVA's).
It also muddies database design principles, because the best hierarchy of relations isn't so clear. Should we design a relation of Suppliers with a nested RVA for parts supplied by a given Supplier? Or a relation of Parts with a nested RVA for suppliers who supply a given Part? Or store both, to make it easy to run different types of queries?
This is the same dilemma that results from the hierarchical database and the document-oriented database models. Eventually, the complexity and cost of accessing nested data structures drives designers to store data redundantly for easier lookup by different queries. The relational model discourages redundancy, so RVA's can work against the goals of relational modeling.
From what I understand (I have not used them), Rel and Dataphor are RDBMS projects that support relation-valued attributes.
Re comment from @dportas:
Structured types are part of SQL-99, and Oracle supports these. But they don't store multiple tuples in the nested table per row of the base table. The common example is an "address" attribute which appears to be a single column of the base table, but has further sub-columns for street, city, postal code, etc.
Nested tables are also supported by Oracle, and these do allow multiple tuples per row of the base table. But I am not aware that this is part of standard SQL. And keep in mind the conclusion of one blog: "I'll never use a nested table in a CREATE TABLE statement. You spend all of your time UN-NESTING them to make them useful again!"
Best Answer
Let's take another example that is less fraught with conceptions and expectations. I've got an enum here, and it is the set of priorities for a bug.
What value are you storing in the database?
So, I could be storing
'C'
,'H'
,'M'
, and'L'
in the database. Or'HIGH'
and so on. This has the problem of stringly-typed data. There's a known set of valid values, and if you aren't storing that set in the database, it can be difficult to work with.Why are you storing the data in the code?
You've got
List<String> priorities = {'CRITICAL', 'HIGH', 'MEDIUM', 'LOW'};
or something to that effect in the code. It means that you've got various mappings of this data to the proper format (you're inserting all caps into the database, but you're displaying it asCritical
). Your code is now also difficult to localize. You have bound the database representation of the idea to a string that is stored in the code.Anywhere you need to access this list, you either need to have code duplication or a class with a bunch of constants. Neither of which are good options. One should also not forget that there are other applications that may use this data (which may be written in other languages - the Java web application has a Crystal Reports reporting system used and a Perl batch job feeding data into it). The reporting engine would need to know the valid list of data (what happens if there's nothing marked in
'LOW'
priority and you need to know that that is a valid priority for the report?), and the batch job would have the information about what the valid values are.Hypothetically, you might say "we're a single-language shop - everything is written in Java" and have a single .jar that contains this information - but now it means that your applications are tightly coupled to each other and that .jar containing the data. You'll need to release the reporting part and the batch update part along with the web application each time there is a change - and hope that that release goes smoothly for all parts.
What happens when your boss wants another priority?
Your boss came by today. There's a new priority -
CEO
. Now you have to go and change all the code and do a recompile and redeploy.With an 'enum-in-the-table' approach, you update the enum list to have a new priority. All the code that gets the list pulls it from the database.
Data rarely stands alone
With priorities, the data keys into other tables that might contain information about workflows, or who can set this priority or whatnot.
Going back to the gender as mentioned in the question for a bit: Gender has a link to the pronouns in use:
he/his/him
andshe/hers/her
... and you want to avoid hard coding that into the code itself. And then your boss comes by and you need to add you've got the'OTHER'
gender (to keep it simple) and you need to relate this gender tothey/their/them
... and your boss sees what Facebook has and... well, yeah.By restricting yourself to a stringly-typed bit of data rather than an enum table, you've now needed to replicate that string in a bunch of other tables to maintain this relationship between the data and its other bits.
What about other datastores?
No matter where you store this, the same principle exists.
You could have a file,
priorities.prop
, that has the list of priorities. You read this list in from a property file.You could have a document store database (like CouchDB) that has an entry for
enums
(and then write a validation function in JavaScript):You could have an XML file with a bit of a schema:
The core idea is the same. The data store itself is where the list of valid values needs to be stored and enforced. By placing it here, it is easier to reason about the code and the data. You don't have to worry about defensively checking what you have each time (is it upper case? or lower? Why is there a
critical
type in this column? etc...) because you know what you are getting back from the datastore is exactly what the datastore is expecting you to send otherwise - and you can query the datastore for a list of valid values.The takeaway
The set of valid values is data, not code. You do need to strive for DRY code - but the issue of duplication is that you are duplicating the data in the code, rather than respecting its place as data and storing it in a database.
It makes writing multiple applications against the datastore easier and avoids having instances where you will need to deploy everything that is tightly coupled to the data itself - because you haven't coupled your code to the data.
It makes testing applications easier because you don't have to retest the entire application when the
CEO
priority is added - because you don't have any code that cares about the actual value of the priority.Being able to reason about the code and the data independently from each other makes it easier to find and fix bugs when doing maintenance.