Rest – Best practice for handling asynchronous inter communication

distributed computingrestweb services

Recently completed a project for handling credit card processing. One of the difficulties I faced was handling the delay / possible failure of notification messages. The most complex example was:

  • an external system sending the request for payment
  • my system turning that request into a request to the payment gateway
  • sending the user to the gateway
  • waiting for the user to perform payment
  • the user returning back to my system but being held until the system receives notification of success/failure
  • Sending the user back to the external system depending on failure

Even more difficult was the fact that upon failure to send the notification the gateway attempts to send the notification every 15 minutes for a number of hours.

I solved it using a database record of pending transactions and then detecting success and failure from the return plus a timed delay listener for the notification and transaction handling…

Reasonably difficult!

But this must have been solved a gazillion times before so what is the best practice?

I can see my future is going to be writing the handling between all of these systems and managing the time delays and possible network failures so I want to be following the best practices.

Book / article recommendations would be great.

Thanks in advance!

Best Answer

When building distributed systems, the difference between a 'synchronous' system and an 'asynchronous' system is this: A synchronous system has known upper bounds on computation and message delivery times. So: you have an asynchronous system where certain events do not have these known upper bounds. How do you handle it?

  1. If these asynchronous processes have probabilistic upper bounds then you can use timeouts to make your system act like a partially synchronous system. If the payment gateway's 98th percentile response time is 5 seconds then a 5 second timout will make 98% of your requests succeed and the other 2% will just fail. This means that you now have a known upper bound on how long this process will take to either succeed or fail. This probabilistic failure detection is a critical tool for turning asynchronous systems into synchronous systems.

  2. Keep a durable record of these events so that you can recover your system state in the event of system failure. If your payment gateway handler is keeping these events in volatile memory and it crashes then you're screwed.

  3. Each complex transaction is essentially a series of state transformations based on the sending and receiving of messages (events) within the system. It sounds like you are informally modeling this using your "record of pending transactions" but I would suggest that you go further: For each transaction you need to manage, create a formal state machine that describes it and keep a durable record of its current state. You will find that these state machines are easy to understand, easy to test, and give much-needed visibility into these processes both for you and your users.

The more asynchronous your system, the more formal and explicit you need to be when managing these complex evented state transformations. Timeouts, durable event logging, and state machines are the best practice here. This is why Erlang OTP bases much of its application behavior on the state machine model, for instance.

For reference, I haven't found anything better than Introduction to Reliable and Secure Distributed Programming. It will give you a strong algorithmic basis for understanding both synchronous and asynchronous systems from first principles.

Related Topic