How to Handle Backpressure in Message Queues – Techniques

awsdesignserverless

I am designing a AWS web service which is going to get 1000 TPS from devices(Android) and it has dependency on multiple downstream services. The usecase is to hit this service periodically from device, get a piece of data and cache it in the device memory. Since device does not require the data immediately, I designed this way

Device: puts a request in queue (SQS)
Service: polls messages from SQS, process the requests and publish the result to devices via FCM

The Problem is service takes at max 2 seconds to process the request and downstream services would not scale as much as service. In short, I can only process fraction of incoming requests per second (Lets say 200 requests, 20% of actual TPS). This leads to backpressure build up in the request queue. Reading through Internet, I found that general strategies to handle backpressure are

  1. make queue bounded, throttle the producer when it exceeds its size and make producer retry after some delay
  2. Increase number of consumers. (In this case, This is not an option due to downstream bottleneck)

Questions:

  1. How throttling helps to solve backpressure problem? If the TPS is consistent, Wouldn't it create the same problem even when producer retry after delay? At the end some producers will exhaust retries and requests go unprocessed.

  2. Initially I wasn't aware of backpressure and was thinking storing messages in queue will aid asynchronous processing but now I am starting to feel queue is creating more problems than It helps. Is queue even relevant for this usecase ?

  3. What are the real benefits of having a queue in front of service?

Appreciate any help!!

Best Answer

A good analogy here is to think of a dam on a river. The river corresponds to the incoming data, and the dam to your consumer. There are three possibilities at any given point:

  1. The incoming river's flow is greater than the dam's outflow
  2. The incoming river's flow is less than the dam's outflow
  3. The incoming river's flow is the same as the dam's outflow

In situation 1, a lake grows behind the dam. This corresponds to your queue. In scenario 2, the lake shrinks. In scenario 3, the lake's volume doesn't change. A big part of why we have dams is to make the downstream flow more consistent. That is, when there's a heavy rain, the lake will get larger but the flow out can be limited. When there is a drought, the lake's reserves are drawn down to keep the outflow higher than it would be otherwise.

So the volume of the lake is equivalent to the total inflow minus the outflow. There is a limit, however, to the amount of water the lake can hold. When the lake is full, you either need to release more water or somehow divert the incoming water.

This corresponds to the iron law of queuing: the depth of the queue is the number of messages received minus the number of messages processed (or removed.) There's no magic. If you don't, on average, pull as many messages as you are putting on the queue, it will grow and eventually hit some sort of size limitation. Queues don't allow you to process more messages; they act as a buffer to help even out the flow and prevent failures when the incoming volume spikes. They also help with distributing the messages to multiple consumers efficiently. Alternately, they can be used as a 'holding area' for batch processing as noted by 'supercat' in the comments. But the law still holds: your overall processing rate must accommodate the incoming rate or your queue will grow.

The upshot: to resolve this issue, you need to either send less to the queue or process them faster. There is no other solution. Backpressure is actually a good thing in a lot of scenarios. It allows the producers to know when the queue is filling up so they can react.

It sounds to me that your issue is the 'downstream bottleneck'. You will never be able to process the volumes that you have coming in until you resolve that. A queue will simply delay how long it takes until you can no longer accept the incoming data.

Related Topic