Linux Readahead Settings – LVM, Device-Mapper, Software Raid, and Block Devices

block-devicelinuxlvmmdsoftware-raid

I've been trying to find a straight answer on this one, and it has proved elusive. This question and its answer is close, but does not really give me the specifics I would like. Let's start with what I think I know.

If you have a standard block device and you run sudo blockdev --report you will get something like this:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    500107862016   /dev/sda
rw   256   512  4096       2048    399999238144   /dev/sda1
rw   256   512  1024  781252606            1024   /dev/sda2

Now, you decide to change that default 256 to 128 using --setra on any of the partitions and it happens to the whole block device, like so:

sudo blockdev --setra 128 /dev/sda1
sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   128   512  4096          0    500107862016   /dev/sda
rw   128   512  4096       2048    399999238144   /dev/sda1
rw   128   512  1024  781252606            1024   /dev/sda2

This makes perfect sense to me – the block level device is where the setting is, not the partition, so it all changes. Also the default relationship between the RA setting and the device makes sense to me, it is generally:

RA * sector size (default = 512 bytes)

Hence, the changes I made above, with the default sector size will drop readahead from 128k to 64k. All well and good so far.

However, what happens when we add in a software RAID, or LVM and device-mapper? Imagine your report looks like this instead:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     10737418240   /dev/xvda1
rw   256   512  4096          0    901875499008   /dev/xvdb
rw   256   512  4096          0    108447924224   /dev/xvdj
rw   256   512  4096          0    108447924224   /dev/xvdi
rw   256   512  4096          0    108447924224   /dev/xvdh
rw   256   512  4096          0    108447924224   /dev/xvdg
rw  4096   512  4096          0    433787502592   /dev/md0
rw  4096   512   512          0    429496729600   /dev/dm-0

In this case we have a device-mapped dm-0 LVM device on top of the md0 created by mdadm, which is in fact a RAID0 stripe across the four devices xvdg-j.

Both the md0 and dm-0 have settings of 4096 for RA, far higher than the block devices. So, some questions here:

  • How does the RA setting get passed down the virtual block device chain?
  • Does dm-0 trump all because that is the top level block device you are actually accessing?
  • Would lvchange -r have an impact on the dm-0 device and not show up here?

If it is as simple as, the RA setting from the virtual block device you are using gets passed on, does that mean that a read from dm-0 (or md0) will translate into 4 x 4096 RA reads? (one on each block device). If so, that would mean that these settings explode the size of the readahead in the scenario above.

Then in terms of figuring out what the readahead setting is actually doing:

What do you use, equivalent to the sector size above to determine the actual readahead value for a virtual device:

  • The stripe size of the RAID (for md0)?
  • Some other sector size equivalent?
  • Is it configurable, and how?
  • Does the FS play a part (I am primarily interested in ext4 and XFS)?
  • Or, if it is just passed on, is it simply the RA setting from the top level device multiplied by the sector size of the real block devices?

Finally, would there be any preferred relationship between stripe size and the RA setting (for example)? Here I am thinking that if the stripe is the smallest element that is going to be pulled off the RAID device, you would ideally not want there to have to be 2 disk accesses to service that minimum unit of data and would want to make the RA large enough to fulfill the request with a single access.

Best Answer

How does the RA setting get passed down the virtual block device chain?

It depends. Let's assume you are inside Xen domU and have RA=256. Your /dev/xvda1 is actual LV on the dom0 visible under /dev/dm1. So you have RA(domU(/dev/xvda1)) = 256 and RA(dom0(/dev/dm1)) = 512 . It will have such effect that dom0 kernel will access /dev/dm1 with another RA than domU's kernel. Simple as that.

Another sittutation will occur if we assume /dev/md0(/dev/sda1,/dev/sda2) sittuation.

blockdev --report | grep sda
rw   **512**   512  4096          0   1500301910016   /dev/sda
rw   **512**   512  4096       2048      1072693248   /dev/sda1
rw   **512**   512  4096    2097152   1499227750400   /dev/sda2
blockdev --setra 256 /dev/sda1
blockdev --report | grep sda
rw   **256**   512  4096          0   1500301910016   /dev/sda
rw   **256**   512  4096       2048      1072693248   /dev/sda1
rw   **256**   512  4096    2097152   1499227750400   /dev/sda2

Setting /dev/md0 RA won't affect /dev/sdX blockdevices.

rw   **256**   512  4096       2048      1072693248   /dev/sda1
rw   **256**   512  4096    2097152   1499227750400   /dev/sda2
rw   **512**   512  4096          0      1072627712   /dev/md0

So generally in my opinion kernel accesses blockdevice in the manner that is actually set. One logical volume can be accessed via RAID (that it's part of) or devicemapper device and each with another RA that will be respected.

So the answer is - the RA setting is IMHO not passed down the blockdevice chain, but whatever the top level device RA setting is, will be used to access the constituent devices

Does dm-0 trump all because that is the top level block device you are actually accessing?

If you mean deep propagation by "trump all" - as per my previous comment I think that you may have different RA's for different devices in the system.

Would lvchange -r have an impact on the dm-0 device and not show up here?

Yes but this is a particular case. Let's assume that we have /dev/dm0 which is LVM's /dev/vg0/blockdevice. If you do:

lvchange -r 512 /dev/vg0/blockdevice

the /dev/dm0 will also change because /dev/dm0 and /dev/vg0/blockdevice is exactly the same block device when it comes to kernel access.

But let's assume that /dev/vg0/blockdevice is the same as /dev/dm0 and /dev/xvda1 in Xen domU that is utilizing it. Setting the RA of /dev/xvda1 will take effect but dom0 will see still have it's own RA.

What do you use, equivalent to the sector size above to determine the actual readahead value for a virtual device:

I typically discover RA by experimenting with different values and testing it with hdparm .

The stripe size of the RAID (for md0)?

Same as above.

Does the FS play a part (I am primarily interested in ext4 and XFS)?

Sure - this is a very big topic. I recommend You start here http://archives.postgresql.org/pgsql-performance/2008-09/msg00141.php