Lvm – SSD, Erase Block Size & LVM: PV on raw device, Alignment

alignmentlvmssd

I want to install a new SSD and use the the whole device as a PV for LVM – in other words: i don't plan to put even one partition on this device. So aligning partitions on the erase blocks is not needed.

Question(s)

Is it sufficient to set --dataalignment to the erase block size when pvcreateing and --physicalextentsize to a multiple of the erase block size when vgcreateing?

So, assuming my SSD has an erase block size of 1024k, is it ok to

pvcreate --dataalignment 1024k /dev/ssd
vgcreate --physicalextentsize $(( x * 1024 ))k ...

Anything else to take into account?

Assuming i'd put ext4-filesystems on the LVs in this VG, it would be a good idea to align the ext4-extents to the LVM-PE-size, right? So ext4-extents should be of same size as or a multiple of LVM-PE-size?

Thanks for any clarification!

Best Answer

Yes, I also checked out all on-disk layout of MBR/PBR/GPT/MD/LVM, and came to the same conclusion.

For your case (LVM on raw disk), if LVM-PE(physical extent) is 1MB-aligned with pvcreate, you can be sure all futher data allocation will be aligned, as long as you keep allocation size to (1MB * N).

Since both "vgcreate -s" and "lvcreate -L" handles size-without-unit as MB value by default, you probably do not need to care much about alignment once you've done your pvcreate properly. Just make sure not to give size in %/PEs (for lvcreate -l) and B(byte)/S(512B - sector is always 512B in LVM)/K(KB) (for vgcreate -s and lvcreate -L).

=== added for clarification ===

Just as a followup, while a SSD may have 1024KB erase block size as a whole device, each internal flash chip's erase block size / rw page size is probably about 32KB-128KB / 512B-8KB.

Although this depends on each SSD's controller, I/O penalty due to extra read-modify-write cycle probably won't happen as long as you keep your write aligned to erase block size of each internal chip, which is 32KB-128KB in above example. It's just you want single write request to be big enough (= erase block size of SSD-as-a-whole-device), so you can expect better performance by efficiently driving all internal chips/channels.

My understanding is that 1024KB-alignment is only a safety measure, as controller chip function varies by a vendor, and flash chip's spec changes rapidly. It's more important to have OS-level write request to be done in a large bundle (1024KB, in this case).

Now, having said that, doing mkfs(8) on 1MB-aligned LVM block will almost certainly break 1MB-alignment for filesystem-level data/metadata. Most filesystems only cares to do 4KB-alignment, so it's probably not perfect for SSDs (but, IIRC, recent fs like btrfs tries to keep 64KB+ alignment when allocating internal contiguous block). But many fs do have a feature to bundle writes (ex: stripe-size configuration) to get performance out of RAID, so that can be used to make write request to SSD near-optimal.

I really want to back my statement with actual data, but it was really difficult to prove as today's SSD controller is so intelligent, and won't show much performance degration once both alignment size and write size is "big enough". Just make sure it's not ill-aligned (avoid <4KB-aligment at all cost) and not too small (1024KB is big enough).

Also, if you really care about IO penalty, double check by disabling device cache and benchmarking with sync-ed read-write-rewrite test.

Related Solutions

Lvm – How to resize a regular (non-LVM) partition

In theory, you could reduce the size of sda1, increase the size of the extended partition, shift the contents of the extended partition down, then increase the size of the PV on the extended partition and you'd have the extra room. However, the number of possible things that can go wrong there is just astronomical, so I'd recommend either buying a second hard drive (and possibly transferring everything onto it in a more sensible layout, then repartitioning your current drive better) or just making some bind mounts of various bits and pieces out of /home into / to free up a bit more space.

Linux – Any way to recover ext4 filesystems from a deleted LVM logical volume

Every time you perform an operation with LVM, by default, the previous metadata is archived in /etc/lvm/archive. You can use vgcfgrestore to restore it, or grab the extends by hand (harder, but lvcreate(8) should cover it).

Edit:

And to make it as easy as possible, I should add that you can find the last backup before your destructive operation by looking at descriptions:

# grep description /etc/lvm/archive/vg01_*
/etc/lvm/archive/vg01_00001.vg:description = "Created before executing 'lvremove -f /dev/vg01/foo'"
/etc/lvm/archive/vg01_00002.vg:description = "Created before executing 'lvremove -f /dev/vg01/bar'"
/etc/lvm/archive/vg01_00003.vg:description = "Created before executing 'lvremove -f /dev/vg01/baz'"