Java – Retry design for high volume

java

I have a Java system using ActiveMQ for messaging. And the system processes about 400 to 600 transactions a second and we have no issue when everything is running smoothly. The system also got to send these transactions to an external system.

When the external System is down for a prolonged time (say an hour or two), what we do is that we drop off failed messages that were not successfully sent to the external system during the outage to a queue (what we call Retry queue).

We need to process these messages in a timely manner so that we give the external system ample time to recover.

We tried several approaches and none seems to work perfectly. Most of them work when we deal with a fewer number of messages.

Approach #1:
We used ActiveMQ delay where we set timestamp in the JMS header (Please see here for more details: http://activemq.apache.org/delay-and-schedule-message-delivery.html) and it worked when there are like a few hundred or thousand messages in the queue.

We found message loss when there were like 500k or more messages.We found that messages appear mysteriously without giving us any clue.

For example, I see that messages disappeared even for 20k messages.

We set the delay as 5 minutes so that messages are tried for up to 12 times in an hour.
When the external system was down for an hour, we expected that all 20k messages were retried at least for 12 times.

What we observed was that when we consume every 5 minutes:

Attempt 1: 20k messages
Attempt 2: 20k messages

Attempt 7: 19987 messages
Attempt 10: 19960 messages
Attempt 12: 19957 messages

Sometimes all the 20k messages were processed but the test results were inconsistent.

Approach #2:

We used ActiveMQ's redelivery policy where we set the policy at the connection factory level, make the session transacted, throw an exception when the external system is down, so that the broker keeps redelivering the messages based on the redelivery policy configuration.
This approach too didn't work well when the outage lasts for a longer duration and we need t to have non-blocking consumers. It works at the dispatch queue level itself, straining the queue when there are a lot of incoming transactions.

Approach #3:

We used Quartz scheduler that wakes up every X minutes and it creates connection, consumers to get messages from the Retry queue, try processing them further and if the external system is still down, they put the failed message to the back of the queue. This approach has lot of issues such that it forced us to manage connections, consumers etc.,

For example, when there are a couple of messages in the queue, when we have more consumers than the number of messages, it resulted in a message picked up by a consumer, again the same consumer dropping off the message back into Retry (as the external system is still down) with another consumer picking it up, resulting in back and forth traveling of message between Consumer and Broker.

Approach #4:

We tried off storing the failed messages in DB and have the quartz scheduler run every X minutes to pick up the messages from the DB.

This is not optimized as well as it involves a lot of transaction check between DB consumer(s) running across multiple nodes and the DB.

My environment is Java, JBoss, ActiveMQ 5.9, MySQL 5.6 and Spring 3.2.

I went thru several other approaches such as Retry template (from Spring) and Asynchronous Retry pattern with Java 7/8

My take on the issue is that most solutions work when there is minimum load and they seem to break when the outage lasts longer or when message volume is really high.

I am looking for something where I can store and forward failed messages.
For a 400 TPS system, in an hour, I may have 1.44 million messages.

If the external System is down, then how I process these 1.44 million messages giving each message an equal chance to be retried without losing messages or performance.

I am looking for a solution within the scope of the environment I have.

Best Answer

The issue here is with throttling. When the system comes up, the application needs to be designed in such a way as not to be overwhelmed on both the publisher and consumer.

You could get clever with your algorithm. If you have the ability to classify a message by priority then the failed messages could be saved with a lower priority. So after the publisher has publishing a new message, it can look into the lower priority queue to check if any failed messages need republishing and republish them.

This is one well known approach to throttling messages. I am sure there are other throttling algorithms that can be applied here based on your specific needs.

Related Topic