C# – Designing a scalable and robust retry mechanism

cdesignsql server

Backstory

I have an Messaging Server Application that is responsible for brokering/proxying calls made from an Application Tier to numerous external services. The purpose of this application is to abstract the technicalities of invoking these services away from the main Application Tier.

This works well as the Application Server does not have to worry about http/ldap etc protocols, wcf, ftp, soap, ebxml etc etc. They simply send a "payload", with a few identifiers, and the Messaging Server handles the rest. It also means that if a service definition changes the Application Server does not need to be changed. Additionally the Messaging Server is back ended by an SQL 2008 database that stores an audit of all messages sent and associated responses etc.

The general data flow architecture of this is as follows:

Application Server(s) => Load Balancer => Messaging Server(s) => [X] => External Services

The Question

I need to implement a retry mechanism into the Messaging Application tier. The intention is to gracefully recover from situations where the Messaging Server isn't able to forward onto the target service (i.e. service down, network issues, timeouts etc), i.e issues with point [X] in the architecture above.

The high level design requirement is:

Application Server sends a request to the Messaging Server. This then attempts to forward to the external services. If the first attempt at this fails the Messaging Server responds synchronously to the Application Server stating that the message is "in retry"

The Messaging Server then proceeds to retry the fowarding per the contract (i.e. X retries with Y seconds between each).

One of two things will happen next, all contracted retries will have been performed unsuccessfully or one of the retries will succeed. In either case a message is sent back into the Application Tier to notify the state of the messaging request.

A few gothcas

The message to retry can't be held in "memory" as if that Messaging Server goes down the message is lost. Additionally a retry contract could be 5 times once every 12 hours, holding data in memory for that length of time isn't feasible. That said some retry contracts could be 5 times once every 5 seconds.

If the forward network goes down and then recovers the load of the retries should be spread across the entire messaging tier rather than a single server.

The question

The commucation between Application and Messaging tier isn't a concern as that framework is already in place. However the architecture of the retry framework in the Messaging Tier is still up in the air. How would you implement this?

Options we have/are considering

On failure store the retry data in a database, then have a polling service that checks the database every second. If a message is found that is scheduled for retry this is push back onto the Messaging Tier via the Load Balancer

On failure store the retry data in a database, use a CLR job to poll the database and push messages, scheduled for retry, back onto the Load Balancer

Other Information

May or may not be relevant:

  • All code is C#
  • Databases are SQL 2008
  • Comms from Application to Messaging tier is performed via WCF with BasicHttpBinding.
  • We have complete control over all aspects of the Messaging Server tier and no control over the Application Tier.
  • The Messaging Tier currently handles around 500k transactions per hour, so you can imagine how quickly things will backup if there is a failure on one of the external services

Best Answer

Consider a exponential retry timeout

In the solution you've provided above having a retry mechanism every 1 second is naive solution.

You should consider a exponential time increase up to a maximum (decided by the business).

This is to avoid situations where you are spending valuable cycles on polling an ever increasing backlog of messages which will only fail and slowing down the processing of messages that can be handled and processed immediately.

Poisoned Messages

It is possible for poisoned messages to show up. These messages may never be able to be processed for one reason or another. You should consider having a process in place to identify and handle these messages.

This is a business decision, not an implementation detail

I think the most pertinent question is not what is said here, but what does the business want to do in this situation? How to actually create a retry polling mechanism is trivial. What the business wants to do in that situation is not, and is purely a business decision.

Real world example

I wrote a Distributed System for my employer a number of years ago (MSMQ, C#, etc). I implemented a system whereby messages would have a retry mechanism which would retry using an exponential function until it hit a max of once per hour.

I had a NAGIOS monitor in place which would poll and then detect number of failed messages in the queue and then push an alert if it reached a certain threshold. This would instantly alert the business that a vendor was offline and that clients who would be expecting a turnaround of lets say an hour.

It would then be a business decision (in this case the business decided to cancel all of these queued backlog of messages and then handle these manually via the helpdesk). And so the application had to be written to be able to allow manual processing of these messages.

In other instances where the vendor was only offline for a few minutes, the retry mechanism picked them up and handled them as per normal. However having the exponential timeout allowed the system to gracefully process regular transactions and the backlog without snowing the server under when the vendor came back online.