Did I beat the CAP Theorem with this master-slaves distributed system (with picture)

data-replicationdistributed computingdistributed-system

I was watching this video about the CAP theorem, where the author explains well the trade-offs of distributed systems. However I disagree with the CAP theorem in the following aspect. Given the picture below:

enter image description here

Whenever there is a partition, in other words, whenever a slave loses its connection to the master, this slave immediately becomes unavailable. So you will say: You are choosing consistency over availability. And I will say NO!. My distributed system is still highly available because there are many other backup/redundant slave nodes that the client can failover to. So I'm keeping my consistency and I'm keeping my availability in the system. A failing slave node is immediately (and automatically) taking offline and the client is redirected to another slave node for reads.

Then you might say: now what happens if the master node dies, or if you have a partition where two master nodes are active? And the answer is simple: Your system must NEVER allow two master nodes to be active. Your system must always have one and only one master node with as many backup master nodes as you want, however all the backup master nodes will be inactive (i.e. not accepting writes and just building redundant state).

The only trade-off of such a system, because nothing is perfect: It will need human intervention for the case of a dying / bad state master, so that the active master can be shutdown by a human and guaranteed to be dead while the operator turns on (manually) one of the backup masters to take over write requests.

I've been thinking for a long time on how to eliminate this human intervention, but I don't think it is possible due to the fact that a machine cannot reliably determine the state of another machine in a distributed system. A human needs to make this decision and manually pull the plug to kill it.

Wouldn't this simple trade-off (human operator for the rare cases when the master is dying) beat the CAP theorem?

Best Answer

whenever a slave loses its connection to the master, this slave immediately becomes unavailable

That is not necessarily true. The CAP argument assumes that when the network is partitioned, there may be clients on both sides of the partition.

...So I'm keeping my consistency and I'm keeping my availability in the system.

The CAP argument also assumes that clients on both sides of the partition want to update the database. If you don't allow them to do that while the partition exists, then the database is not available to all clients for writing. If you do allow them to do that while the network is partitioned, then the database is not consistent because nodes on opposite sides of the partition now have different data.

It's not rocket science.

Your system must NEVER allow two master nodes to be active.

How shall nodes that can't communicate with one another agree upon which one is the master?

If you don't allow updates to a node that can't talk to a master, then you have given up availability again.

It will need human intervention for the case of a dying / bad state master

That would be unacceptable in many of today's large-scale enterprise systems.

A human needs to make this decision and manually pull the plug to kill it.

In the fully general case, maybe so, but if there are any rules that you would write down to guide a new employee in how to make that decision, then you could write those same rules in a computer program that would react much faster than a human operator ever could do.

Related Topic