Postgresql – Disadvantages of table with too many columns

database-designpostgresql

I have some data that I need to put in a PostgreSQL database. These data are related to schools. So there are a LOT of attributes relating to the school, mostly small integers, floats or small texts. And all the data changes yearly. So I'm creating an entity called YearlyData and putting the attributes there. But the thing is, the number of attributes are around 50-60. Now these cannot be normalized because they are plain attributes of the school itself. So I ethically cannot separate them into tables. But I'm not sure if this would harm my performance.

I can try to categorize those data and put them in separate tables and point to them from the YearlyData table. But then trying to search schools with 20-30+ parameters would cause insane amount of joins, I'm guessing. I'm also not sure if that'd harm my performance.

Any expert advice?

Best Answer

There are a few things to consider here:

Does the list of attributes change significantly over time
Does the list of attributes require custom user-defined attributes
Are there different attributes for different schools (i.e. many attributes only apply to one or a few schools)?

If any of these are true, you might think about a properties store approach like EAV, hstore, json fields, xml fields, etc.

If not - if you have a fairly static list of properties where most of them make sense for most of the rows - then there's not really a problem with having them as 60 individual columns. It'll be easier to add indexes for commonly searched for sets of attributes, including partial and composite indexes, etc, and searches - particularly those for many different attributes - will be much faster.

There's also a compromise option available to you: A main table for the most important details you look up a lot, plus side-tables for logical groupings of attributes. Say:

yearly_summary (
    yearly_summary_id serial primary key,
    school_id integer,
    total_students integer,
    ...
)

plus

yearly_student_stats(
    yearly_summary_id integer primary key references yearly_summary(yearly_summy_id) on delete cascade,
    ...
)

etc. The integer primary key that's also a foreign key means you have an enforced 1:1 (optional) relationship to the other table. This approach can be useful if you have a few logical groupings of attributes that you can cluster into side-tables.

I'd also be surprised if a little more thought didn't reveal things that do make sense to normalize. Do you have year7_blah, year8_blah, year9_blah etc columns? If so: Great candidate for normalization.

Related Solutions

PostgreSQL “DESCRIBE TABLE”

Try this (in the psql command-line tool):

\d+ tablename

See the manual for more info.

Mysql – thesql too many columns

To elaborate on RichardOD's answer, you generally have three options when dealing with subtyping, and which you choose depends on what you need to do with the data in question.

The first option is the one you're currently using: keep all columns related to the different types in one table, with flags and nulls used to indicate which type a given record is. It is the simplest way to manage subtyping, and it generally works well when you only have a few types or if the different types aren't very different. In your case, it seems like the types can vary quite a bit.

The second option is to keep a central table that contain all of the common columns between the subtypes, and have one-to-one relationships with other tables that contains the type-specific details of those types.

The third option is to not think of the different types as subtypes at all and just keep all the types' records in separate tables. So you'd have no common table between the types that keeps the common data, and each table would have some columns that are repeated across tables.

Now, each option has its place. You'd use the first option when there aren't many differences between the different types. You'd use the second option if you need to manipulate the common fields independently of the type-specific fields; for example, if you wanted to list all sports games in a big grid with general information, and then let users click to see the type-specific details of that game. You'd use the third option when the types aren't really very related at all and you're just storing them together out of convenience; dissimilar schemas, even if it shares a few fields, shouldn't be merged.

So think about what you need to do with the data and how it fits into the three options and decide for yourself which is best. If you can't decide, update your question with the details about how you plan to use the data and I or someone else should be able to help you more.

Best Answer

Related Solutions

PostgreSQL “DESCRIBE TABLE”

Mysql – thesql too many columns

Related Topic