Ubuntu – Why does mdadm write unusably slow when mounted synchronously

debiandisk-cachemdadmraidUbuntu

I have a 6 disk raid6 mdadm array I'd like to benchmark writes to:

root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 sda[0] sdf[5] sde[4] sdd[3] sdc[2] sdb[1]
      1953545984 blocks level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]

Benchmarks can be inaccurate because of cache – for example, notice the write speed here is higher than it should be:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.276026 s, 380 MB/s

Now we can disable each disk cache pretty easily:

root@ubuntu:~# hdparm -W0 /dev/sd*

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdb:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdc:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdd:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sde:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdf:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

But there is still Linux caching:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s

To disable Linux caching, we can mount the filesystem synchronously:

mount -o remount,sync /mnt/raid6

But after this writes become way slower than they should be:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 23.3311 s, 449 kB/s

It's as if mdadm requires async mounts in order to function. What's going on here?

Best Answer

Quote by questioner:

But there is still Linux caching:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s

To disable Linux caching, we can mount the filesystem synchronously:

mount -o remount,sync /mnt/raid6

That's not quite right... sync doesn't simply disable caching like you want in a benchmark. It makes every write result in a "sync" command, which means flushing cache all the way to the disk.

Here is a server over here, to explain better:

$ dd if=/dev/zero of=testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.183744 s, 2.9 GB/s

$ dd if=/dev/zero of=testfile bs=1M count=500 conv=fdatasync
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 5.22062 s, 100 MB/s

conv=fdatasync simply means flush after the write, and tell you the time including that flush. Alternatively, you can do:

$ time ( dd if=/dev/zero of=testfile bs=1M count=500 ; sync )
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.202687 s, 2.6 GB/s

real    0m2.950s
user    0m0.007s
sys     0m0.339s

And then calculate MB/s from the 2.95s real time rather than the above 0.2s. But that is uglier, and more work, since the stats printed by dd are not including the sync.

If you used "sync" you would flush every write... maybe that means every block, which would run very slow. "sync" should only be used on very strict systems, eg. databases where the loss of one single transaction due to a disk failure is unacceptable (eg. if I transfer a billion bucks from my bank account to yours, and the system crashes, and suddenly you have the money but so do I).

Here is another explanation with additional options, one I read about long ago. http://romanrm.ru/en/dd-benchmark

And one more note: Your benchmark you are doing this way is totally valid in my opinion, although not valid in many others' opinions. But it is not a real-life test. It is a single threaded sequential write. If your real life use case is like that, eg. sending some big files over the network, then it may be a good benchmark. If your use case is different, eg. an ftp server with 500 people uploading small files at the same time, then it is not very good.

And also, you should use a randomly generated file on RAM for best results. It should be random data because some file systems are too smart when you feed them zeros. eg. on Linux using the ram file system tmpfs which is mounted on /dev/. And it should be a RAM fs instead of using /dev/urandom directly because /dev/random is really slow, and /dev/urandom is faster (eg. 75MB/s) but still slower than hdd.

dd if=/dev/urandom of=/dev/shm/randfile bs=1M count=500
dd if=/dev/shm/randfile bs=1M count=500 conv=fdatasync