AWS – SQS Trigger to Lambda Disabled When Lambda Fails

amazon-lambdaamazon-web-services

We have some Lambdas triggered by SQS queues. The Lambdas do intensive inserts into DynamoDB tables. The DynamoDB tables have autoscaling write capacity.

On peak loads, many numbers of messages come to Lambdas and they start to fail with ProvisionedThroughputExceededException. The DynamoDB needs minutes to scale up.

We expect when the Lambda fails the messages return back to SQS and are processed again after visibility timeout. This looks correct because later DynamoDB is scaled up and should be able to handle the grown writes.

However, we see a strange thing. When the number of execution errors for Lambda grows up, the SQS trigger is automatically disabled. The Lambda stops executions, the messages are accumulated in the queue.

Manual enabling of the trigger causes even more failures because DynamoDB is still not scaled up, but the number of messages to process from the queue was dramatically increased.

Only manual increasing the write capacity of DynamoDB helps.

Why the SQS trigger disables? This behavior is not documented.

How to avoid the trigger disabling?

In general, what is the recommended way to do a "backpressure" to limit the speed of polling the messages from SQS by a Lambda?

Best Answer

I'm not sure why the Lambda stops working. I suspect Lambda service notices that it keeps failing so it temporarily suspends it. Not sure.

You can try a number of workarounds:

  • Use DynamoDB on demand capacity - AWS says it scales instantly.
  • Alternatively if you use provisioned capacity and get the Provisioned Throughput Exception don't actually abort the Lambda execution but instead re-insert the message to the SQS queue and exit successfully. That way Lambda service won't see any failures and no SQS messages will get lost either.

Something along these lines could help :)

Related Topic