Lvm – SSD, Erase Block Size & LVM: PV on raw device, Alignment

alignmentlvmssd

I want to install a new SSD and use the the whole device as a PV for LVM – in other words: i don't plan to put even one partition on this device. So aligning partitions on the erase blocks is not needed.

Question(s)

Is it sufficient to set --dataalignment to the erase block size when pvcreateing and --physicalextentsize to a multiple of the erase block size when vgcreateing?

So, assuming my SSD has an erase block size of 1024k, is it ok to

  • pvcreate --dataalignment 1024k /dev/ssd
  • vgcreate --physicalextentsize $(( x * 1024 ))k ...

Anything else to take into account?

Assuming i'd put ext4-filesystems on the LVs in this VG, it would be a good idea to align the ext4-extents to the LVM-PE-size, right? So ext4-extents should be of same size as or a multiple of LVM-PE-size?

Thanks for any clarification!

Best Answer

Yes, I also checked out all on-disk layout of MBR/PBR/GPT/MD/LVM, and came to the same conclusion.

For your case (LVM on raw disk), if LVM-PE(physical extent) is 1MB-aligned with pvcreate, you can be sure all futher data allocation will be aligned, as long as you keep allocation size to (1MB * N).

Since both "vgcreate -s" and "lvcreate -L" handles size-without-unit as MB value by default, you probably do not need to care much about alignment once you've done your pvcreate properly. Just make sure not to give size in %/PEs (for lvcreate -l) and B(byte)/S(512B - sector is always 512B in LVM)/K(KB) (for vgcreate -s and lvcreate -L).

=== added for clarification ===

Just as a followup, while a SSD may have 1024KB erase block size as a whole device, each internal flash chip's erase block size / rw page size is probably about 32KB-128KB / 512B-8KB.

Although this depends on each SSD's controller, I/O penalty due to extra read-modify-write cycle probably won't happen as long as you keep your write aligned to erase block size of each internal chip, which is 32KB-128KB in above example. It's just you want single write request to be big enough (= erase block size of SSD-as-a-whole-device), so you can expect better performance by efficiently driving all internal chips/channels.

My understanding is that 1024KB-alignment is only a safety measure, as controller chip function varies by a vendor, and flash chip's spec changes rapidly. It's more important to have OS-level write request to be done in a large bundle (1024KB, in this case).

Now, having said that, doing mkfs(8) on 1MB-aligned LVM block will almost certainly break 1MB-alignment for filesystem-level data/metadata. Most filesystems only cares to do 4KB-alignment, so it's probably not perfect for SSDs (but, IIRC, recent fs like btrfs tries to keep 64KB+ alignment when allocating internal contiguous block). But many fs do have a feature to bundle writes (ex: stripe-size configuration) to get performance out of RAID, so that can be used to make write request to SSD near-optimal.

I really want to back my statement with actual data, but it was really difficult to prove as today's SSD controller is so intelligent, and won't show much performance degration once both alignment size and write size is "big enough". Just make sure it's not ill-aligned (avoid <4KB-aligment at all cost) and not too small (1024KB is big enough).

Also, if you really care about IO penalty, double check by disabling device cache and benchmarking with sync-ed read-write-rewrite test.

Related Topic