Postgresql – 100 terabytes database on PostgreSQL without sharding

databasepostgresql

Is it realistic to setup a 100 TB database (about 90 TB actually) on PostgreSQL without data sharding between a number of nodes? Are there any success stories / examples about similar setups?

Best Answer

50K writes per second that need to be absorbed is more than a challenge usually. Even in synthetic benchmarks with quite simple inserts, PostgreSQL's limits tend to max out around roughly 10 K/s - and there you don't even have such a large beast in terms of database size.

Also the I/O system for that single PostgreSQL node is going to be interesting as even with RAID 10 and assuming that 50K inserts are going to be equal to just 50K IOPS (which is probably wrong, but it depends on your database scheme and indices), you are going to need roughly a hundred disks paired with a very good array that saves you from buying several hundred disks to service those writes in a timely manner.

If sharding is easy and you expect such a huge write load then go for sharding. Writes can be very difficult to scale.

Related Solutions

Postgresql – Export and import a PostgreSQL database with a different name

The solution was dumping it like this:

pg_dump --no-owner --no-acl blah > blah.psql

and importing it like this:

psql blah_devel < blah.psql > /dev/null

I still get this warning:

WARNING:  database "blah" does not exist

but the rest seems to work.

Linux – How to calculate max_connections for PostgreSQL and default_pool_size for pgbouncer

First, please read our canonical question on Capacity Planning.
^{_{The specific advice you're asking for is capacity planning advice, and you're going to have to work that out on your own, for your particular environment.}}

Second, You are looking at this wrong.
The amount of memory (or any other resource) you have doesn't dictate the number of connections you set, the number of connections you need dictates how beefy a server you must purchase.
The per-connection resource requirements are given in the manual in considerable detail, as well as discussed on the Wiki you linked to. Figure out what your environment needs (or make an educated guess) and ensure the hardware you are going to run on can handle what you're going to throw at it.

Specifically re: connection limits and pool size, you should have "enough" connections to meet your application's requirements - either on a single server or through a pool/bouncer.

"Enough" is a relative number: An application that makes (and continually reuses) one connection only requires one connection. An application that establishes a connection for each end-user who logs in requires as many DB connections as it has users.

The default values for both Postgres and pgbouncer are sensible as defaults:

100 database connections is a lot for the typical person throwing Postgres into an environment.
Developers probably won't need more than 10. Anyone else will know enough to increase the number.
20 connections from pgbouncer per DB pool means you can get 4 pools pointing at one server and not overwhelm the default Postgres connection limit.
It is possible to have multiple pooled resources in pgbouncer pointing at one back-end database, and you always want some available connections on your back-end servers.

If the defaults are not suitable for your environment you are expected to change them.

Remember that pooled connections does not mean "always tie up every database connection available".
The point of pgbouncer as you noted is to reuse connections. The efficiency gain here doesn't require that you tie up every available connection, merely that you don't disconnect, reconnect, re-negotiate SSL, re-authenticate to the database, and re-run your connection setup queries every time.

Best Answer

Related Solutions

Postgresql – Export and import a PostgreSQL database with a different name

Linux – How to calculate max_connections for PostgreSQL and default_pool_size for pgbouncer

Related Topic