Database Design – Why Store Enums in a Database?

databasedatabase-designdesignenum

I've seen a number of questions, like this, asking for advice on how to store enums in DB. But I wonder why would you do that. So let's say that I have an entity Person with a gender field, and a Gender enum. Then, my person table has a column gender.

Besides the obvious reason of enforcing correctness, I don't see why I would create an extra table gender to map what I already have in my application. And I don't really like having that duplication.

Best Answer

Let's take another example that is less fraught with conceptions and expectations. I've got an enum here, and it is the set of priorities for a bug.

What value are you storing in the database?

So, I could be storing 'C', 'H', 'M', and 'L' in the database. Or 'HIGH' and so on. This has the problem of stringly-typed data. There's a known set of valid values, and if you aren't storing that set in the database, it can be difficult to work with.

Why are you storing the data in the code?

You've got List<String> priorities = {'CRITICAL', 'HIGH', 'MEDIUM', 'LOW'}; or something to that effect in the code. It means that you've got various mappings of this data to the proper format (you're inserting all caps into the database, but you're displaying it as Critical). Your code is now also difficult to localize. You have bound the database representation of the idea to a string that is stored in the code.

Anywhere you need to access this list, you either need to have code duplication or a class with a bunch of constants. Neither of which are good options. One should also not forget that there are other applications that may use this data (which may be written in other languages - the Java web application has a Crystal Reports reporting system used and a Perl batch job feeding data into it). The reporting engine would need to know the valid list of data (what happens if there's nothing marked in 'LOW' priority and you need to know that that is a valid priority for the report?), and the batch job would have the information about what the valid values are.

Hypothetically, you might say "we're a single-language shop - everything is written in Java" and have a single .jar that contains this information - but now it means that your applications are tightly coupled to each other and that .jar containing the data. You'll need to release the reporting part and the batch update part along with the web application each time there is a change - and hope that that release goes smoothly for all parts.

What happens when your boss wants another priority?

Your boss came by today. There's a new priority - CEO. Now you have to go and change all the code and do a recompile and redeploy.

With an 'enum-in-the-table' approach, you update the enum list to have a new priority. All the code that gets the list pulls it from the database.

Data rarely stands alone

With priorities, the data keys into other tables that might contain information about workflows, or who can set this priority or whatnot.

Going back to the gender as mentioned in the question for a bit: Gender has a link to the pronouns in use: he/his/him and she/hers/her... and you want to avoid hard coding that into the code itself. And then your boss comes by and you need to add you've got the 'OTHER' gender (to keep it simple) and you need to relate this gender to they/their/them... and your boss sees what Facebook has and... well, yeah.

By restricting yourself to a stringly-typed bit of data rather than an enum table, you've now needed to replicate that string in a bunch of other tables to maintain this relationship between the data and its other bits.

What about other datastores?

No matter where you store this, the same principle exists.

  • You could have a file, priorities.prop, that has the list of priorities. You read this list in from a property file.

  • You could have a document store database (like CouchDB) that has an entry for enums (and then write a validation function in JavaScript):

      {
         "_id": "c18b0756c3c08d8fceb5bcddd60006f4",
         "_rev": "1-c89f76e36b740e9b899a4bffab44e1c2",
         "priorities": [ "critical", "high", "medium", "low" ],
         "severities": [ "blocker", "bad", "annoying", "cosmetic" ]
      }
    
  • You could have an XML file with a bit of a schema:

      <xs:element name="priority" type="priorityType"/>
    
      <xs:simpleType name="priorityType">
        <xs:restriction base="xs:string">
          <xs:enumeration value="critical"/>
          <xs:enumeration value="high"/>
          <xs:enumeration value="medium"/>
          <xs:enumeration value="low"/>
        </xs:restriction>
      </xs:simpleType>
    

The core idea is the same. The data store itself is where the list of valid values needs to be stored and enforced. By placing it here, it is easier to reason about the code and the data. You don't have to worry about defensively checking what you have each time (is it upper case? or lower? Why is there a critical type in this column? etc...) because you know what you are getting back from the datastore is exactly what the datastore is expecting you to send otherwise - and you can query the datastore for a list of valid values.

The takeaway

The set of valid values is data, not code. You do need to strive for DRY code - but the issue of duplication is that you are duplicating the data in the code, rather than respecting its place as data and storing it in a database.

It makes writing multiple applications against the datastore easier and avoids having instances where you will need to deploy everything that is tightly coupled to the data itself - because you haven't coupled your code to the data.

It makes testing applications easier because you don't have to retest the entire application when the CEO priority is added - because you don't have any code that cares about the actual value of the priority.

Being able to reason about the code and the data independently from each other makes it easier to find and fix bugs when doing maintenance.

Related Topic