Pub-Sub – Handling Failing Subscribers in C#

asp.net-coreccqrsdomain-driven-designpubsub

I'm working on a microservices application that is implemented from the ground up with MediatR em CQRS. We have a list of domain events that will be published via MediatR [simple pub-sub library that implements a mediator to commands and events] as soon as a Command (CQRS) commits the database.

Let's imagine a new user was created in our application through a Command (CQRS). After creating the resource, an event about this was published through the MediatR and we have 5 subscribers to this event.

What should happen if 1 of those 5 subscribers fail and the whole transaction needs to be undone? Please assume that I'm talking about domain failings, not stuff like unhandled exceptions. This failing subscriber can't proceed properly because its domain knowledge determined that (eg: a payment transaction could fail in a subscriber of a domain event because the user doesn't have enough money).

Actually in this project we are simply calling MediatR.Publish (an async method) in a fire-and-forget manner, without awaiting or checking the result of the task. I'm concerned that the application can enter in an inconsistent state if we have failing subscribers.

Best Answer

Having 5 subscribers for one event seems only appropriate if all 5 actions are independent, in that the failure of any one doesn't affect the others.

If the failure of one, A, affects another, B, then delay the execution of B, until the success of A is known.  So now instead of 5 subscribers to one event: 4 subscribers to that event and A generates a new event that B subscribes to instead of the original event.  If A fails, then B does not get an event.

Otherwise, if there is a circularity in failures, yet subscribers can tolerate temporary inconsistency (e.g. all that's needed to do is reclaim resources like a reservation) an undo or error event can be generated.

And still, if we cannot tolerate temporary inconsistencies and also if there is a circularity in failures affecting others then will need to set up a kind of commit/rollback capability, where each subscriber tentatively does its operation and reports by event whether it can succeed/fail, and some other handler is listening for either total success or partial failure, and informs the others of it to commit or rollback.  (If failure in the commit process itself is possible even after reporting success, then we in 2 phase commit territory.)

It would probably be best if to avoid the latter.