Microservices – Are retries enough

I have been asked to look at moving our current architecture to microservices. I am aware of the warning to always assume a request could fail.

So I am aware we should always be prepared to retry the request. However, when designing this, I am also assuming that the retry can fail.

So with that in mind we have been looking at a pattern where either all the processing in committed or it always rollback. This is achieved via message Outbox (and Inbox) Outbox pattern. The services stored the functional changes in their database, then within the same transaction stores the event messages in their database in an Outbox. A separate dispatcher service then dispatches the messages from the Outbox and sends it to a messaging system. It is detailed in this series of articles Life Beyond Distributed Transactions: An Apostate's Implementation

To me this is the safest option because if the dispatcher fails to send the message, it is available for a retry.

However one of my colleagues thinks that although we need to retry, the solution will be resilient enough that the message will always be successfully sent to the messaging system. E.g. the issue that causes the need for the retry will always be transient, and will be cleared by in time for one of our retries to succeed.

I'm looking for a opinions on whether I'm being over cautious and retries should be enough. Therefore I do not need the dispatcher or the outbox pattern.

I guess the main problem is not that a service I am calling cannot be reached, but the server my service is running on shuts down.

Best Answer

The outbox and retries strategy will work but adds a lot of complexity to each microservice. Because this strategy by necessity forces you into an asynchronous approach, consider using an event-driven architecture.

The microservices communicate through events published on highly available event channels (kafka, rabbitmq, ...). Each service pulls events from the channel, does whatever it should in respond to them, and produces new events as a consequence.

These are events, not messages, meaning they describe only what a service did, not what it wants another service to do. Correlation id's are used to observe which actions happened as a consequence of broadcasting an event.
The CQRS pattern is used to create views of necessary data from broadcasted events. In this way each service can handle all read requests without any synchronous call to another service. The consequences on other services of a write are not handled by the microservice itself, but by the other services which subscribe to the event it publishes when it is written to.

The benefits of this approach:

Microservices need not have any awareness of other microservices. They are only aware of which events they subscribe to.
The event channel is separate from the microservices, allowing a simpler microservice design while still guaranteeing event delivery.
The pull instead of push design eliminates the need for retries. Either a service is up and pulls events from the channel, or it is down and does not pull events, which will remain in the channel until it reappears.

The downsides are that you will have to redesign everything in the Event Sourcing style (which can be difficult), and that the explosion of events can be hard to manage (solutions: schema stores, AsyncAPI / Cloudevents.io, correlation and causation id's, logging service that observes all events and records them).

Best Answer

Related Solutions

REST – How to Determine Transient Exceptions?

Python Microservices – How to Create a Python Microservice on AWS That Accepts REST Connections and Processes SQS Messages

Related Topic