Postgresql – Disadvantages of table with too many columns

database-designpostgresql

I have some data that I need to put in a PostgreSQL database. These data are related to schools. So there are a LOT of attributes relating to the school, mostly small integers, floats or small texts. And all the data changes yearly. So I'm creating an entity called YearlyData and putting the attributes there. But the thing is, the number of attributes are around 50-60. Now these cannot be normalized because they are plain attributes of the school itself. So I ethically cannot separate them into tables. But I'm not sure if this would harm my performance.

I can try to categorize those data and put them in separate tables and point to them from the YearlyData table. But then trying to search schools with 20-30+ parameters would cause insane amount of joins, I'm guessing. I'm also not sure if that'd harm my performance.

Any expert advice?

Best Answer

There are a few things to consider here:

  • Does the list of attributes change significantly over time
  • Does the list of attributes require custom user-defined attributes
  • Are there different attributes for different schools (i.e. many attributes only apply to one or a few schools)?

If any of these are true, you might think about a properties store approach like EAV, hstore, json fields, xml fields, etc.

If not - if you have a fairly static list of properties where most of them make sense for most of the rows - then there's not really a problem with having them as 60 individual columns. It'll be easier to add indexes for commonly searched for sets of attributes, including partial and composite indexes, etc, and searches - particularly those for many different attributes - will be much faster.

See also: Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

There's also a compromise option available to you: A main table for the most important details you look up a lot, plus side-tables for logical groupings of attributes. Say:

yearly_summary (
    yearly_summary_id serial primary key,
    school_id integer,
    total_students integer,
    ...
) 

plus

yearly_student_stats(
    yearly_summary_id integer primary key references yearly_summary(yearly_summy_id) on delete cascade,
    ...
)

etc. The integer primary key that's also a foreign key means you have an enforced 1:1 (optional) relationship to the other table. This approach can be useful if you have a few logical groupings of attributes that you can cluster into side-tables.

I'd also be surprised if a little more thought didn't reveal things that do make sense to normalize. Do you have year7_blah, year8_blah, year9_blah etc columns? If so: Great candidate for normalization.