“Message queue service not available” in Windows Failover Cluster

failoverclustermsmqwindows-cluster

I am debugging on a site where our application runs on a 3-node failover cluster with an MSMQ cluster group for message queueing.
We are seeing that the system works on some combinations of nodes, but not all, thus fail over security is not as good as intended.

The issue is with receiving messages from the clustered queue.

When our application runs on cluster node B or C, it works regardless of which node MSMQ is running on (works = our application receives messages). When our application runs on node A, it fails due to message queue service not available, regardless of where MSMQ is running.

To confuse things even more, I created a small WCF-MQ-proxy service with a GUI client, that allows me to send a command to the service, which will then send to or receive from a message queue as specified by the client – and give as much feedback as possible in the process. The pattern is the same with this app, except the node where it fails is node C – regardless of where MSMQ is running.

Here are some of the things I have checked:

  • The service (our app) runs under the same domain user accounts on all 3 nodes.
  • The app config file contains the same path to the message queue.
  • The queue access rights: everyone has full control.
  • The local MSMQ service is running on all nodes and I made sure the local queues are not named the same as the clustered ones.
  • Firewall is disabled on all nodes.
  • Node A is different from B and C in that it has an extra network connection on the same subnet as the cluster network. So when I ping it from node B, it responds on the "wrong" interface. Not sure if it matters, but it's a bit strange.
  • The service option "Use network name for machine name" does not seem to change anything. My proxy service reports it's perceived machine name, and for node A it always returns the cluster group name, on nodes B and C it always returns the node name.
  • The MSMQ cluster group uses a shared iscsi drive for storage.

I am just a developer, not a Microsoft infrastructure expert by any means, so I'd like to ask: what are the recommended steps to take when debugging a clustered MSMQ setup like this?

Best Answer

Ok, so after several weeks of debugging this on my own and together with Microsofts Message Queue support team, a solution has been found.

TLDR; the solution is to remove or rename the registry key

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\<SERVICENAME>\Environment

The reason for the error is that the MQ client cannot find a MQ service on the local system - and that's necessary for communicating with a remote MQ - sort of like a local SMTP service forwarding your emails to remote systems. However, the local system is not the cluster node in this case, but the "cluster group" and there is no MQ service running on the cluster group (because it is not a real system, just an alias). The reason the MQ client looks for a service on the cluster group, is that the "Use network name for computer name" checkbox had been checked in the cluster service settings. This adds a new value in the cluster nodes' registry, setting environment for the service. And the real issue is that when this checkbox is unckecked, it does not remove the value from the registry, effectively making it impossible to clear the setting properly (from the GUI) once it has been set. So the fix is to delete the value manually with regedit or regedt.