I am exploring RabbitMQ quorum queues to improve HA for some services in a Kubernetes cluster. As I am reading, they are designed with data safety in mind.
However, the chapter "Managing Replicas" states:
Replicas of a quorum queue are explicitly managed by the operator.
When a new node is added to the cluster, it will host no quorum queue
replicas unless the operator explicitly adds it to a member (replica)
list of a quorum queue or a set of quorum queues.
It seems therefore that, in case of disruptions (especially involuntary), the following situation could arise (for a 3-nodes cluster):
- after a disruption a node would go down: the other two nodes still compose the majority and will "keep the queue alive", possibly electing a new leader;
- kubernetes will provide a new node (pod) to replace the failed node; the new node will automatically rejoin the RabbitMQ cluster, but
- unless the operator manually intervenes, the new node will not contribute to the existing quorum queues;
- for a 3-nodes cluster, this means that there is no HA anymore: if, sometime in the future, one of the other nodes fails, the queue is effectively lost;
Is there any way to mitigate this scenario? Is it, for example, possible to have nodes automatically rejoin all existing quorum queue clusters? Maybe by maintaining a list of "startup commands" (which run after RabbitMQ starts) to which we could add the rejoin commands?
Best Answer
The RabbitMQ team highly recommends the use of the official Kubernetes operator - https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Aside from that, here's what the local k8s expert has to say:
As long as the same name and data is used, the "new" node will join just as if it were the old one.
There are probably scenarios that require manual intervention but they aren't as frequent as you'd think.
NOTE: the RabbitMQ team monitors the
rabbitmq-users
mailing list and only sometimes answers questions on StackOverflow.