Database Design – Choosing Zipcode as Primary Key for Address

database-designnormalizationrelational-databasesql

I was following a tutorial on the normal forms of SQL databases, and I got confused landing on this example : https://www.tutorialspoint.com/sql/third-normal-form.htm.

From

CREATE TABLE CUSTOMERS(
   CUST_ID       INT              NOT NULL,
   CUST_NAME     VARCHAR (20)      NOT NULL,
   DOB           DATE,
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   ZIP           VARCHAR(12),
   EMAIL_ID      VARCHAR(256),
   PRIMARY KEY (CUST_ID)
);

It creates a new table ADDRESS like this because there is a "transitive dependency between zipcode and address".

CREATE TABLE ADDRESS(
   ZIP           VARCHAR(12),
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   PRIMARY KEY (ZIP)
);

This is where I am really confused. Why use the zipcode as a primary key? Primary keys have to be unique, can't you have two addresses with the same zipcode?

Either

  • I don't understand how a zipcode work
  • I don't understand how a primary key work
  • This example is clearly wrong
  • I don't understand something else.

Best Answer

The example is making a fundamental mistake: it's using data as a primary key. It should create and use unique IDs.

The comments debate how correct it is to assume that a zipcode maps to a particular street. Whether that's correct or not, the simple fact is that for this to work without a unique ID it must be correct, not only now but forever more. That is exactly why it's wrong to do this. You can't possibly know the future. Use a unique ID.

If zipcode uniquely identifies the data you're normalizing now then you have a natural key. But adding just one more record can destroy that. Natural keys can be used when importing data to help build unique ID relations. They should not be used when an application is collecting data from a user that can ensure the relationship is real.

People get these two use cases of structuring data confused all the time. Unique ID's should always be preferred in operational systems. The problem is they don't always exist. When they don't we can construct uniqueness by selecting data fields to be natural keys as we normalize. But that constructed uniqueness is ALWAYS brittle. It might only be true now. It's ok to use the fact that it's true now to generate unique IDs now. But after that new data should be assigned unique ID's properly.

Unique ID's don't erode as more data is added. Natural keys often do. Developers who field systems that insist that their natural key assumptions hold regardless of reality often cause problems that operators find themselves having to work around. Please don't do that.

Related Topic