Debian – How to get DRBD nodes out of Connection State StandAlone (and WFConnection)

debiandrbdlinux-ha

My Debian 8.9 DRBD 8.4.3 setup somehow has got into a state where the two nodes cannot connect over the network any more. They should replicate a single resource r1, but immediately after drbdadm down r1; drbadm up r1 on both nodes their /proc/drbd describe the situation as follows:

on 1st node (Connection State is either WFConnection or StandAlone):

1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
   ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20

on 2nd node:

1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48

The two nodes can ping each other over the IP addresses cited in /etc/drbd.d/r1.res, and netstat shows that both are listening on the cited port.

How can I (further diagnose and) get out of this situation so that the two nodes can become Connected and replicate over DRBD again?

BTW, on a higher level of abstraction this problem currently manifests itself by systemctl start drbd never exiting, apparently because it gets stuck in drbdadm wait-connect all (as suggested by /lib/systemd/system/drbd.service).

Best Answer

The situation was apparently caused by a case of split-brain.

I had not noticed this because I had only inspected recent journal entries for drbd.service (sudo journalctl -u drbd), but the problem apparently was reported in other kernel logs and slightly earlier (sudo journalctl | grep Split-Brain).

With that, manually solving the split-brain (as described here or here) also resolved the troublesome situation as follows.

On split-brain victim (assuming the DRBD resource is r1):

drbdadm disconnect r1
drbdadm secondary r1
drbdadm connect --discard-my-data r1

On split-brain survivor:

drbdadm primary r1
drbdadm connect r1
Related Topic