Postgresql – How well does PostgreSQL perform with a large number of databases

database-administrationdatabase-performancepostgresql

We have an web application whose architecture requires that any registered user (a company, actually) should be isolated from the other, i.e., I'll run the same webapp with the same data models, but with different data sets for every customer.

So, we did think about creating a different database in Postgres for every customer. Can this solution scale to, say, 10-20K databases? How well?

Does anyone have a better solution for this?

Thanks in advance.

Best Answer

At the low end, it basically boils down to "can you absolutely say that you have no shared data?" Unlike mysql, the database is an absolute boundry in postgresql. You cannot SELECT zip_code FROM common.city_zip WHERE city=... if you go with separate databases (at least not without dblink).

If you have any shared data at all, postgresql's "schema" is similar to what mysql calls a "database". You can CREATE SCHEMA clienta; CREATE TABLE clienta.customer (...);. You would create a schema for each client, that client's user would have their schema first in their search path, and permissions would be granted so that Client A's user would have access to the clienta and the public schemas (and their tables).

Your issue is going to be that at the high end of # of clients, each table is stored as a file, so whether you go with one database per client, one schema per client, or use something like ${client}_customer for your table names, you will likely run into filedescriptor limits with 10k clients even if you only had one table per client (plus one filedescriptor per connection). Of course, you can adjust the kernel's maximum number of file descriptors on the fly using sysctl, but the per-process limit (ulimit) will require restarting postgresql if you set it too low the first time around.

The alternative is to have "one big table" with a client column that identifies which client that row belongs to (ideally, by username if you have one user per client, this makes the stuff below a LOT easier). By not granting any access at all to this table by the clients, you can create client-specific views (or use session_user to identify the current client). Updates can't be done directly through a view, though. You would need to have defined functions to insert/update/delete on the table (one set of functions per client or else using session_user) with the functions using SECURITY DEFINER to execute as a special user with permission to insert/update/delete on the tables (note: session_user is used because user and current_user are based on the current context, and within a SECURITY DEFINER function this would always be the user who defined the function).

Performance-wise, beyond the fd issue, I honestly don't know what would happen with 10000 databases in postgresql, versus having one large table with 10000 clients' worth of data in it. Proper index design should keep the large table from being slow to query.

I will say that I went with separate databases for each client here (we add servers to keep the system usable, shifting client databases to new servers as needed, so we will never get to 10k databases on one server). I've had to restore individual clients' data from backups for debugging or due to user error on a regular basis, something that would be an absolute nightmare on the "one big table" design. Also, if you intend to sell customization of your product to your clients, the "one big table" design might end up hobbling you as far as ability to customize the data model.