I have been asked to look at moving our current architecture to microservices. I am aware of the warning to always assume a request could fail.
So I am aware we should always be prepared to retry the request. However, when designing this, I am also assuming that the retry can fail.
So with that in mind we have been looking at a pattern where either all the processing in committed or it always rollback. This is achieved via message Outbox (and Inbox) Outbox pattern. The services stored the functional changes in their database, then within the same transaction stores the event messages in their database in an Outbox. A separate dispatcher service then dispatches the messages from the Outbox and sends it to a messaging system. It is detailed in this series of articles Life Beyond Distributed Transactions: An Apostate's Implementation
To me this is the safest option because if the dispatcher fails to send the message, it is available for a retry.
However one of my colleagues thinks that although we need to retry, the solution will be resilient enough that the message will always be successfully sent to the messaging system. E.g. the issue that causes the need for the retry will always be transient, and will be cleared by in time for one of our retries to succeed.
I'm looking for a opinions on whether I'm being over cautious and retries should be enough. Therefore I do not need the dispatcher or the outbox pattern.
I guess the main problem is not that a service I am calling cannot be reached, but the server my service is running on shuts down.
Best Answer
The outbox and retries strategy will work but adds a lot of complexity to each microservice. Because this strategy by necessity forces you into an asynchronous approach, consider using an event-driven architecture.
The microservices communicate through events published on highly available event channels (kafka, rabbitmq, ...). Each service pulls events from the channel, does whatever it should in respond to them, and produces new events as a consequence.
The benefits of this approach:
The downsides are that you will have to redesign everything in the Event Sourcing style (which can be difficult), and that the explosion of events can be hard to manage (solutions: schema stores, AsyncAPI / Cloudevents.io, correlation and causation id's, logging service that observes all events and records them).