Should you retry 500 API errors

apiapi-designerror handling

My team and I are integrating with a 3rd party company and using their API to perform different CRUD operations. Their API isn't always reliable though. Maybe 0.1% of the time an API call just fails with a 500 error, and then when you try again, it works fine. Occasionally, they will have times where the API call fails with a 500 error more than 90% of the time. After ~10 attempts, it will finally work. It usually lasts for about 4 hours and occurs every few months. We experienced this yesterday where 90% of the time the API call would fail.

The 3rd party told us that we should implement retry logic when receiving a 500 error. For idempotent operations, I could see this being useful. However, for non-idempotent operations it seems like it could be dangerous. For example, one API call might be to send an Email. We've noticed at times that the API might return a 500 error, even though the email was sent. If I retry the API call 10 times, I might get 10 500 errors, but the recipient might still get 10 emails. This could be very bad.

Should we actually be adding retry logic for 500 errors that are not idempotent?

My two thoughts on when we should add retry:

  1. When we are okay with the operation running more than once
  2. When the operation failing is worse than the operation succeeding 10 times (or whatever the retry limit is).

I've never had to add an retry logic to API calls as getting 500 errors is very slim. Is it reasonable to have to do this?

Best Answer

This boils down to what is worse for the specific non-idempotent operation:

  • when it is executed more than once, or

  • when it is executed not at all.

This is not a technical design decision, it depends on the specific operation and the consequences of such failures in the domain.

As a technical measure, you could check if there are additional options for validating if a certain operation has happened (at least with a certain probability). Let us take the email example: if the API allows you to send email requests with an included blind-copy request, you may utilize this feature to send your own system such a blind copy. That gives you an opportunity to implement an additional checkpoint if the original email was sent or not. Of course, the blind copy may not reach your system in time because of a network lag, but you can reduce the frequency of duplicate email operations through the API.

Or maybe the API itself will allow you to inspect a certain status or logging information for certain operations. Though these calls can fail as well, at least you may be able to lower the failure rate to an acceptable degree by using these features.

As a general engineering principle: if you have an unreliable component in your system, create enough redundancy to make the system "reliable enough".

Related Topic