I've been trying to find a straight answer on this one, and it has proved elusive. This question and its answer is close, but does not really give me the specifics I would like. Let's start with what I think I know.
If you have a standard block device and you run sudo blockdev --report
you will get something like this:
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 500107862016 /dev/sda
rw 256 512 4096 2048 399999238144 /dev/sda1
rw 256 512 1024 781252606 1024 /dev/sda2
Now, you decide to change that default 256 to 128 using --setra
on any of the partitions and it happens to the whole block device, like so:
sudo blockdev --setra 128 /dev/sda1
sudo blockdev --report
RO RA SSZ BSZ StartSec Size Device
rw 128 512 4096 0 500107862016 /dev/sda
rw 128 512 4096 2048 399999238144 /dev/sda1
rw 128 512 1024 781252606 1024 /dev/sda2
This makes perfect sense to me – the block level device is where the setting is, not the partition, so it all changes. Also the default relationship between the RA setting and the device makes sense to me, it is generally:
RA * sector size (default = 512 bytes)
Hence, the changes I made above, with the default sector size will drop readahead from 128k to 64k. All well and good so far.
However, what happens when we add in a software RAID, or LVM and device-mapper? Imagine your report looks like this instead:
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 10737418240 /dev/xvda1
rw 256 512 4096 0 901875499008 /dev/xvdb
rw 256 512 4096 0 108447924224 /dev/xvdj
rw 256 512 4096 0 108447924224 /dev/xvdi
rw 256 512 4096 0 108447924224 /dev/xvdh
rw 256 512 4096 0 108447924224 /dev/xvdg
rw 4096 512 4096 0 433787502592 /dev/md0
rw 4096 512 512 0 429496729600 /dev/dm-0
In this case we have a device-mapped dm-0 LVM device on top of the md0 created by mdadm, which is in fact a RAID0 stripe across the four devices xvdg-j.
Both the md0 and dm-0 have settings of 4096 for RA, far higher than the block devices. So, some questions here:
- How does the RA setting get passed down the virtual block device chain?
- Does dm-0 trump all because that is the top level block device you are actually accessing?
- Would
lvchange -r
have an impact on the dm-0 device and not show up here?
If it is as simple as, the RA setting from the virtual block device you are using gets passed on, does that mean that a read from dm-0 (or md0) will translate into 4 x 4096 RA reads? (one on each block device). If so, that would mean that these settings explode the size of the readahead in the scenario above.
Then in terms of figuring out what the readahead setting is actually doing:
What do you use, equivalent to the sector size above to determine the actual readahead value for a virtual device:
- The stripe size of the RAID (for md0)?
- Some other sector size equivalent?
- Is it configurable, and how?
- Does the FS play a part (I am primarily interested in ext4 and XFS)?
- Or, if it is just passed on, is it simply the RA setting from the top level device multiplied by the sector size of the real block devices?
Finally, would there be any preferred relationship between stripe size and the RA setting (for example)? Here I am thinking that if the stripe is the smallest element that is going to be pulled off the RAID device, you would ideally not want there to have to be 2 disk accesses to service that minimum unit of data and would want to make the RA large enough to fulfill the request with a single access.
Best Answer
It depends. Let's assume you are inside Xen domU and have RA=256. Your /dev/xvda1 is actual LV on the dom0 visible under /dev/dm1. So you have RA(domU(/dev/xvda1)) = 256 and RA(dom0(/dev/dm1)) = 512 . It will have such effect that dom0 kernel will access /dev/dm1 with another RA than domU's kernel. Simple as that.
Another sittutation will occur if we assume /dev/md0(/dev/sda1,/dev/sda2) sittuation.
Setting /dev/md0 RA won't affect /dev/sdX blockdevices.
So generally in my opinion kernel accesses blockdevice in the manner that is actually set. One logical volume can be accessed via RAID (that it's part of) or devicemapper device and each with another RA that will be respected.
So the answer is - the RA setting is IMHO not passed down the blockdevice chain, but whatever the top level device RA setting is, will be used to access the constituent devices
If you mean deep propagation by "trump all" - as per my previous comment I think that you may have different RA's for different devices in the system.
Yes but this is a particular case. Let's assume that we have /dev/dm0 which is LVM's /dev/vg0/blockdevice. If you do:
the /dev/dm0 will also change because /dev/dm0 and /dev/vg0/blockdevice is exactly the same block device when it comes to kernel access.
But let's assume that /dev/vg0/blockdevice is the same as /dev/dm0 and /dev/xvda1 in Xen domU that is utilizing it. Setting the RA of /dev/xvda1 will take effect but dom0 will see still have it's own RA.
I typically discover RA by experimenting with different values and testing it with hdparm .
Same as above.
Sure - this is a very big topic. I recommend You start here http://archives.postgresql.org/pgsql-performance/2008-09/msg00141.php