Architecture – Saving High-Frequency Events to a Connection-Limit Constrained Database

Architectureevent-sourcingperformancescalability

We've got a situation where I have to deal with a massive influx of events coming in to our server, at about 1000 events per second, on average (peak could be ~2000).

The problem

Our system is hosted on Heroku and uses a relatively expensive Heroku Postgres DB, that allows a maximum of 500 DB connections. We use connection pooling to connect from the server to the DB.

Events come in faster than the DB connection pool can handle

The problem we have is that events come faster than the connection pool can handle. By the time one connection has finished the network roundtrip from the server to the DB, so it can get released back to the pool, more than n additional events come in.

Eventually the events stack up, waiting to get saved and because there are no available connections in the pool, they time out and the whole system is rendered non-operational.

We've solved the emergency by emitting the offending high-frequency events at a slower pace from the clients, but we still want to know how to handle this scenarios in the event we need to handle that high-frequency events.

Constraints

Other clients might want to read events concurrently

Other clients continuously request to read all the events with a particular key, even if they are not saved in the DB yet.

A client can query GET api/v1/events?clientId=1 and get all the events sent by client 1, even if those events are not done saving in the DB just yet.

Are there any "classroom" examples on how to deal with this?

Possible solutions

Enqueue the events on our server

We could enqueue the events on the server (with the queue having a maximum concurrency of 400 so the connection pool doesn't run out).

This is bad idea because:

  • It will eat up available server memory. The stacked-up enqueued events will consume massive amounts of RAM.
  • Our servers restart once every 24 hours. This is a hard limit imposed by Heroku. The server can restart while events are enqueued causing us to lose the enqueued events.
  • It introduces state on the server, thus hurting scalability. If we have a multi-server setup and a client wants to read all the enqueued + saved events, we won't know on which server the enqueued events live.

Use a separate message queue

I assume we could use a message queue, (like RabbitMQ?), where we pump the messages in it and on the other end there is another server that only deals with saving the events on the DB.

I'm not sure if message queues allow querying enqueued events (that weren't saved yet) so if another client wants to read the messages of another client, I can just get the saved messages from the DB and the pending messages from the queue and concatenate them together so I can send them back to the read-request client.

Use multiple databases, each saving a portion of the messages with a central DB-coordinator server to manage them

Another solution we've though is to use multiple databases, with a central "DB coordinator/load balancer". Upon receiving an event it
this coordinator would choose one of the databases to write the message to. This should allow us to use multiple Heroku databases thus upping the connection limit to 500 x number of databases.

Upon a read query, this coordinator could issue SELECT queries to each database, merge all the results and send them back to the client that requested the read.

This is bad idea because:

  • This idea sounds like … ahem.. over-engineering? Would be a nightmare to manage as well (backups etc..). It's complicated to build and maintain and unless it's absolutely necessary it sounds like a KISS violation.
  • It sacrifices Consistency. Doing transactions across multiple DB's is a no-go if we go with this idea.

Best Answer

Input stream

It is not clear if your 1000 events/second represent peaks or if it's a continuous load:

  • if it's a peak, you could use a message queue as buffer to spread the load on the DB server over a longer time;
  • if it's constant load, the message queue alone is not sufficient, because the DB server will never be able to catch up. Then you'd need to think about a distributed database.

Proposed solution

Intuitively, in both cases, I'd go for a Kafka based event-stream:

  • All events are systematically published on a kafka topic
  • A consumer would subscribe to the events and store them to the database.
  • A query processor will handle the requests from the clients and query the DB.

This is highly scalable at all levels:

  • If DB server is the bottleneck, just add several consumers. Each could subscribe to the topic, and write to a different DB server. However, if the distribution occurs randomly across the DB servers, the query processor will not be able to predict the DB server to take and have to query several DB servers. This could lead to a new bottleneck on the query side.
  • The DB distribution scheme could therefore be anticipated by organising the event stream into several topics (for example, using groups of keys or properties, to partition the DB according to a predictable logic).
  • If one message server is not sufficient to handle a growing flood of input events, you could add kafka partitions to distribute kafka topics across several physical servers.

Offering events not yet written in the DB to clients

You want your clients to be able to get access also to information still in the pipe and not yet written to the DB. This is a little more delicate.

Option 1: Using a cache to complement db queries

I have not analysed in depth, but the first idea that comes to my mind would be to make the query processor(s) a consumer(s) of the kafka topics, but in a different kafka consumer group. The request processor would then receive all the messages that the DB writer will receive, but independently. It could then keep them in a local cache. The queries would then run on DB + cache (+ elimination of duplicates).

The design would then look like:

enter image description here

The scalability of this query layer could be achieved by adding more query processors (each in its own consumer group).

Option 2: design a dual API

A better approach IMHO would be to offer a dual API (use the mechanism of the separate consumer group):

  • a query API for accessing events in the DB and/or making analytics
  • a streaming API that just forwards messages directly from the topic

The advantage, is that you let the client decide what is interesting. This could avoid that you systematically merge DB data with freshly cashed data, when the client is only interested in new incoming events. If the delicate merge between fresh and archived events is really needed, then the client would have to organise it.

Variants

I proposed kafka because it's designed for very high volumes with persistent messages so that you can restart the servers if needed.

You could build a similar architecture with RabbitMQ. However if you need persistent queues, it might decrease performance. Also, as far as I know, the only way to achieve the parallel consumption of the same messages by several readers (e.g. writer+cache) with RabbitMQ is to clone the queues. So a higher scalability might come at a higher price.

Related Topic