Broken RabbitMQ cluster wont ‘restart

rabbitmq

I run RabbitMQ on 3 servers, same version of Erlang and RabbitMQ: RabbitMQ 3.4.1, Erlang 17.3

One node crashed on server 2. The two other nodes did not connect together:

server 1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

server 3:

[de3 ~]$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@de3 ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,[rabbit@de3]},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

After restarting and resetting rabbitmq on server 3, it finally connected to server1:

[CentOS-62-64-minimal ~]$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@CentOS-62-64-minimal' ...
[{nodes,[{disc,['rabbit@CentOS-62-64-minimal',rabbit@de3,rabbit@mysql]}]},
 {running_nodes,['rabbit@CentOS-62-64-minimal']},
 {cluster_name,<<"rabbit@CentOS-62-64-minimal">>},
 {partitions,[]}]

Why did the cluster "break" with just 1 node down? server 3 was working fine, but server 1 was not: "Queue is located on a server that is down".

As for server 2, it did not restart. After a manual restart, I cannot make it reconnect to the cluster, even after multiple reset and removing /var/lib/rabbitmq/mnesia/:

[root@mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"rabbit@mysql.domain.com">>},
 {partitions,[]}]

[mysql ~]# rabbitmqctl stop_app
Stopping node rabbit@mysql ...
[root@mysql ~]# rabbitmqctl force_reset
Forcefully resetting node rabbit@mysql ...
[ysql ~]# rabbitmqctl join_cluster rabbit@CentOS-62-64-minimal
Clustering node rabbit@mysql with 'rabbit@CentOS-62-64-minimal' ...
Error: {ok,already_member}
[mysql ~]# rabbitmqctl start_app
Starting node rabbit@mysql ...
[mysql ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@mysql ...
[{nodes,[{disc,[rabbit@mysql]}]},
 {running_nodes,[rabbit@mysql]},
 {cluster_name,<<"rabbit@mysql.domain.com">>},
 {partitions,[]}]

I have no idea what went wrong. Last time this happened, I upgraded RabbitMQ qnd Erlang to the latest version.

Best Answer

I had this issue today designing an intentional break document for a breakfix event to teach our operations team how to fix stuff. I intentionally unclustered a node and was unable to run the rabbitmqctl join_cluster successfully because the cluster believed the node to already be a member.

Clustering node 'rabbit@node-1' with 'rabbit@node-0' ... ...done (already_member).

Ultimately what worked for me was rabbitmqctl forget_cluster_node rabbit@node-1 from a working clustered node. Once I did that, I was able to successfully run rabbtmqctl join_cluster rabbit@node-0

Related Solutions

Make RabbitMQ Listen Only to Localhost

Putting the following in /etc/rabbitmq/rabbitmq-env.conf will make RabbitMQ and epmd listen on only localhost:

export RABBITMQ_NODENAME=rabbit@localhost
export RABBITMQ_NODE_IP_ADDRESS=127.0.0.1
export ERL_EPMD_ADDRESS=127.0.0.1

It takes a bit more work to configure Erlang to only use localhost for the higher numbered port (which is used for clustering nodes as far as I can tell). If you don't care about clustering and just want Rabbit to be run fully locally then you can pass Erlang a kernel option for it to only use the loopback interface.

To do so, create a new file in /etc/rabbitmq/ - I'll call it rabbit.config. In this file we'll put the Erlang option that we need to load on run time.

[{kernel,[{inet_dist_use_interface,{127,0,0,1}}]}].

If you're using the management plugin and also want to limit that to localhost, you'll need to configure its ports separately, making the rabbit.config include this:

[ {rabbitmq_management, [ {listener, [{port, 15672}, {ip, "127.0.0.1"}]} ]}, {kernel, [ {inet_dist_use_interface,{127,0,0,1}} ]} ].

(Note RabbitMQ leaves epmd running when it shuts down, so if you want to block off Erlang's clustering port, you will need to restart epmd separately from Rabbit.)

Next we need to have RabbitMQ load this at startup. Open up /etc/rabbitmq/rabbitmq.conf again and put the following at the top:

export RABBITMQ_CONFIG_FILE="/etc/rabbitmq/rabbit"

This loads that config file when the rabbit server is started and will pass the options to Erlang.

You should now have all Erlang/RabbitMQ processes listening only on localhost! This can be checked with netstat -ntlap

EDIT : In older versions of RabbitMQ, the configuration file is : /etc/rabbitmq/rabbitmq.conf. However, this file has been replaced by the rabbit-env.conf file.

How to restart rabbitmq after switching machines

I got some very good help from the rabbitmq-discuss list:

The database RabbitMQ uses is bound to the machine's hostname, so if you copied the database dir to another machine, it won't work. If this is the case, you have to set up a machine with the same hostname as before and transfer any outstanding messages to the new machine. If there's nothing important in rabbit, you could just clear everything by removing the RabbitMQ files in /var/lib/rabbitmq.

I deleted everything in /var/lib/rabbitmq/mnesia/rabbit/ and it started up without trouble. Hooray!

Best Answer

Related Solutions

Make RabbitMQ Listen Only to Localhost

How to restart rabbitmq after switching machines

Related Topic