Extremely slow resync rate of DRBD on dedicated gigabit

drbd

I've set up DRBD on 2 nodes, and started using it yesterday. After about an hour, it had resynced 50% of the partition. Another 12 hours passed, and it's up to 79%, and moving VERY slow.

Here's what cat /proc/drbd shows:

 1: cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate C r-----
    ns:464931976 nr:191087032 dw:656013660 dr:214780588 al:100703 bm:21100 lo:7 pe:0 ua:0 ap:7 ep:1 wo:f oos:92241852
        [==============>.....] sync'ed: 79.2% (90076/431396)M
        finish: 76:13:38 speed: 332 (8,680) want: 19,480 K/sec

I looked at network traffic, and I am using between 1M and 20M on 1G interface. Tried running iperf while all this is going on, and I get 930M reading. Tried adjusting syncer rate to 10M, 50M, 500M to no avail. Tweaked packet size without luck too.

Now, the caveat, as you can see from the status, is that my primary node is inconsistent. So I assume that OS is working with essentially a secondary node while resync is going. But given that throughput is so low, I don't understand why sync is not faster.

Any ideas on what I can try next? Estimated finish of 76 hours is not something I look forward to 🙁 Especially not knowing the reason, so come an outage of sorts, I wouldn't know how to bring the array to consistency fast.

Thanks!

EDIT: I tried the following settings in the net section to no avail:

sndbuf-size       512k;
max-buffers      20480;
max-epoch-size   16384;
unplug-watermark 20480;

EDIT 2: For no apparent reason, speed jumped to 10~30M, after I stopped tweaking all the configs. Got up to 98.8% synced, and dropped back to ~300K. No messages in the logs on neither of the servers. Coincidentally, I see a spike in INSERT activity on MySQL database that runs off of this partition. Any ideas?

EDIT 3: Version: 8.4.2 (api:1/proto:86-101)

Best Answer

After @Nils comment, I started looking into how utilized the disks are. And noticed that I am getting a lot more reads than I used to before the system reconfiguration to DRBD. Futher research showed disk utilization at near 100%, and slow down of batch processes that were running at that time. Fixing MySQL config to increase buffer pool size to eliminate most of the reads looks like fixed the issue.

So the problem was that the drives were so busy, that they couldn't handle a lot of resync work that DRBD wanted to throw at them.