REST – How to Determine Transient Exceptions?

architectural-patternsArchitectureexceptionshttprest

As I try to deal with the possibility of failure when I invoke RESTful endpoints or in general any HTTP endpoint I've been wondering if there is any standard or pattern in the HTTP specification or in the industry to deal with the problem of determining when an exception is transient and when it is not particularly for retryability purposes.

Since a remote service might fail at any moment, determining when we ought to retry an operation should to be a fundamental feature of our distributed architecture.

  • I started by considering the method idempotency first (i.e. get, put, delete are safe to retry according to our service contracts).
  • Then I considered the status code where for e.g. 4xx are not retried, but 5xx errors may be retried depending on the type of error, i.e. whether it is transient or not. For example, if the remote service uses a database and I get a 5xx error caused by a query timeout, then it's safe to retry since the condition is transient and the service invocation will most likely succeed if I retry. However, there could be other type of errors that are not transient and that I would prefer to avoid retrying. For example, in the past we have had errors caused due to a DBA adding a new constraint to a table and then a service that was working before started failing with a non-transient error like a constraint violation.

In the past, we have made the mistake of infinitely retrying a 500 error thinking that they are always transient and that the remote service will eventually recover and be able to handle the request. Particularly in computer-to-computer interactions (orchestrations) where we'd like to avoid at all costs propagating an exception since it would require complex compensating transactions and publishing partially processed requests in a DLQ for later human intervention.

In general I'd like to know how do people in the industry usually deal with this issue, how is this property of "transientness" conveyed from the server to the client using the HTTP protocol.

Should I resort to using customized http status codes or should I communicate this property in a header or in a property of the body?

If there is a standardize solution for this?

That would be awesome because if there is, I could expect third-party services to consume my services knowing they will behave correctly and at the same time it means I could also consume theirs without having to reconfigure my retry protocol for their particular implementation.

I welcome any suggestions on retryability protocols to design good service contracts that help me build good service citizens of our ecosystem that may potentially need to integrate with third-party services in the future.

So far I am particularly amazed on discovering that a distributed architecture pattern like this does not make of retryability a first class citizen, but it could be that I am mistaken in my interpretation of how the implementation should work. Can anybody please point me into the right direction?

Best Answer

For your transient failures, don't return 500. Return 503 Service Unavailable. The query timeout is being caused by "a temporary overload", not by an underlying error in the code. In your response, include a Retry-After header indicating how long the client should wait before retrying the request.