DRBD IOWait – high i/o-disk and iowait, but low traffic and r/w-disk

drbd

I need your help.

I have the DRBD-cluster (9.6.0, kernel 3.10.0-957.21.3, CentOS 7). In this cluster I have two drbd-disks:

  • drbd0 for SSD
  • drbd1 for HDD

With drbd0 (SSD, sda) all good – it's in UpDate status. But with drbd1 (HDD, sdb) I see next:

# drbdadm status
drbd0 role:Primary
  disk:UpToDate
  slave role:Secondary
    peer-disk:UpToDate

drbd1 role:Primary
  disk:UpToDate
  slave role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:0.17

It's VERY slow – I get 0.17% in 6 hour. I known, what HDD more slowly then SSD, but this vary bad.

Info:

I have this configuration (commented lines – it's my experiments):

# cat /etc/drbd.d/global_common.conf 
global {
 usage-count  yes;
}
common {
 net {
  protocol B;
  }
}

# cat /etc/drbd.d/drbd0.res 
resource drbd0 {
        on master {
                device /dev/drbd0;
                disk /dev/mapper/vg_ssd_drbd-lv_ssd_drbd;
                meta-disk internal;    
                address 192.168.100.15:7788;
        }
        on slave  {
                device /dev/drbd0;
                disk /dev/mapper/vg_ssd_drbd-lv_ssd_drbd;
                meta-disk internal;
                address 192.168.100.17:7788;
        }
        net {
                sndbuf-size 10M;
                rcvbuf-size 10M;
                ping-int 2;
                ping-timeout 2;
                connect-int 2;
                timeout 5;
                ko-count 5;
                max-buffers 128k;
                max-epoch-size 8192;
                verify-alg md5;
        }
        disk {
                c-plan-ahead 20;
                c-min-rate 1M;
                c-max-rate 600M;
                c-fill-target 2M;
                al-extents 3389;
        }
}

# cat /etc/drbd.d/drbd1.res 
resource drbd1 {
        on master {
                device /dev/drbd1;
                disk /dev/mapper/vg_hdd_drbd-lv_hdd_drbd;
                meta-disk internal;    
                address 192.168.100.15:7789;
        }
        on slave  {
                device /dev/drbd1;
                disk /dev/mapper/vg_hdd_drbd-lv_hdd_drbd;
                meta-disk internal;
                address 192.168.100.17:7789;
        }
        net {
                #sndbuf-size 1M;
                #rcvbuf-size 1M;
                ping-int 2;
                ping-timeout 2;
                connect-int 2;
                timeout 5;
                ko-count 5;
                #max-buffers 12k;
                #max-epoch-size 8192;
                #verify-alg md5;
        }
        disk {
                #c-plan-ahead 20;
                c-min-rate 1K;
                c-max-rate 600M;
                #c-fill-target 2M;
                al-extents 919;
        }
}

Servers have 10Gbps channel link-to-link – both locate in one room.

I can show you my monitoring:

enter image description here

At nigth I sync my SSD – all good.
But at day I tried to sync my HDD and it's very cry.

IO to growing instantly, but read and write operations to disk not have no one servers. With NET-traffic situation is identical.

enter image description here

If I connecting to server, I see this picture:

top - 12:52:35 up 1 day, 10:44,  1 user,  load average: 1.01, 1.06, 1.26
Tasks: 492 total,   1 running, 491 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni,  0.0 id, 99.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

DRBD1 (I see him in iostat) loading iowait in CPU on 100%, but WriteKB and ReadKB is ~zero.

I googled it's problem and I was advised show TCP Buffer, but it's fine. I flush all DRBD-setting for drbd1 to default, but it's don't have result.

enter image description here

I have try to diagnostic problem by yourself and finded two anomaly:

One. I see some one "Time Spent Doing I/Os" in 1 sec. I think, what I get timeout here.

enter image description here

Two. In me HDD mounted catalog I see big difference for df/du and ls commands. Maybe this is feature KVM, but I not sure.

du -sh /data/hdd-drbd/*
170M    /data/hdd-drbd/awx-add.qcow2
7.7G    /data/hdd-drbd/awx.qcow2
2.0G    /data/hdd-drbd/template-DISABLE.qcow2
ls -lah /data/hdd-drbd/
total 9.8G
drwxr-xr-x  2 root root   74 Aug 16 17:37 .
drwxr-xr-x. 8 root root   91 Aug 14 22:11 ..
-rw-------  1 qemu qemu 201G Aug 15 19:41 awx-add.qcow2
-rw-------  1 qemu qemu 7.7G Aug 18 17:26 awx.qcow2
-rw-------  1 root root  46G Aug 15 13:48 template-DISABLE.qcow2

Now I will going to move all data to SSD disk and will try resync empty disk – maybe – it's will be fine. But I need your help for this problem – do you have some ideas for this situations?

EDIT:

One more – why I resyncing my storages? I added to some PV in my LVM for drbd[0/1] and resize drbd-devices. Maybe this is important information… Before this operations drbd worked fine.

EDIT2:

The empty disk resync selfsame…

Best Answer

I have the CRUTCH-style solve.

First, I moved all data from HDD-drbd to SSD-drbd and recreate drbd-device. Since this sync work fine.

Second, I (maybe) find one problem with performance. See graphics

enter image description here

I haved a 2 hours of the nice performance, but then I tried start my KVM's VM. And magic, performance is down (~13:10 by graphic). Then I stop VMs and performance restored.

I think this is due to the fact that you should not give even a minimum load on the DRBD during synchronization. However, I really hope that after synchronization this problem will not be.