Linux – DRBD Dual Primary Revisited

drbdlinux

It is an old datum that it is not possible to use a non-cluster-aware filesystem like ext4 on Linux with DRBD in dual-primary mode.

For example, as stated by Linbit in their manual "Dual Primary – think twice":

DRBD replicates the changes from node A to node B and the other way around. 
It changes the contents of the physical storage device. However - as DRBD resides 
under the mentioned Ext4 filesystems, the filesystem on the physical disk of 
node A does not notice the changes coming from node B (and vice versa). 
This process is called a concurrent write. Starting from now, the actual content 
of the storage device differs from what the filesystem there thinks it should be. 
The filesystem is corrupt."

My question is – why is this?

Because, if the METADATA of that file system is stored on the same DRBD device, any change like the one described above would be synced between the two DRBD nodes as well, and so the file systems on both ends (which consist of data + metadata, don't they?) are fully in sync.
True, what node 1 wrote has been overwritten by node 2, but if I issue a "dir" command on node 1, I would see there is another file than node 1 just copied. The same happens on simple shared folders such as Windows CIFS shares. This does not render the file system corrupt.

So where is the problem? Why is everyone saying the file system will be corrupt? Does it mean the ext4 file systems do NOT store metadata on the actual device itself but store it elsewhere, such as in the root file system? Per what I can read on the internals of the ext4 FS this is not the case. (I have to say I haven't gone into too deep details on ext4).

But it should be more or less like this:

Node1 writes a new file to block 34098 (and updates the directory entry as well):

Node 1
 - Directory Entry: /data/myfile1.txt  34098
 -----> block 34098 contains: myfile1.txt

At the "same time", Node2 writes the following to block 34098. It can never be "at the same time", so we assume it is when DRBD has just completed above sync.

Node2
 - Directory Entry: /data/other.txt  34098
 -----> block 34098 contains: other.txt

DRBD should now sync again the block 34098 back to node1, both the directory entry and the block 34098.

Along with writing the file "other.txt" to blocck 34098, the file system on node2 will also update the block containing the directory entry (which is just another file) pointing to block 34098. So it should always be in sync, or not?

Best Answer

The kernel has an in-memory image of the state it thinks the file system is in and it doesn't check the disk to see if it might have changed, because this can't happen, as only the local kernel is allowed to change the file system and it knows what it does and doesn't need to check. If you make changes on the second node, the on-disk structures will be different from what the kernel expects and data-loss is nearly guaranteed.

And since cluster-aware file systems add quite a lot of synchronization and checks to the picture to avoid all kind of problems, it's not as easy as letting the kernel read the file system before every operation to make e.g. ext4 cluster capable.