Linux – Best practices for thin-provisioning Linux servers (on VMware)

linuxlvm

I have a setup of about 20 Linux machines, each with about 30-150 gigabytes of customer data. Probably the size of data will grow significantly faster on some machines than others. These are virtual machines on a VMware vSphere cluster. The disk images are stored on a SAN system.

I'm trying to find a solution that would use disk space sparingly, while still allowing for easy growing of individual machines.

In theory, I would just create big disks for each machine and use thin provisioning. Each disk would grow as needed. However, it seems that a 500 GB ext3 filesystem with only 50 GB of data and quite a low number of writes still easily grows the disk image to eg. 250 GB over time. Or maybe I'm doing something wrong here? (I was surprised how little I found on the subject with Google. BTW, there's even no thin-provisioning tag on serverfault.com.)

Currently I'm planning to create big, thin-provisioned disks – but with a small LVM volume on them. For example: a 100 GB volume on a 500 GB disk. That way I could more easily grow the LVM volume and the filesystem size as needed, even online.

Now for the actual question:

Are there better ways to do this? (that is, to grow data size as needed without downtime.)

Possible solutions include:

Using a thin-provisioning friendly filesystem that tries to occupy the same spots over and over again, thus not growing the image size.
Finding an easy method of reclaiming free space on the partition (re-thinning?)
Something else?

A bonus question: If I go with my current plan, would you recommend creating partitions on the disks (pvcreate /dev/sdX1 vs pvcreate /dev/sdX)? I think it's against conventions to use raw disks without partitions, but it would make it a bit easier to grow the disks, if that is ever needed. This is all just a matter of taste, right?

Best Answer

If I understand thin provisioning correctly then it could really cause problems if you aren't monitoring your VMFS filesystems growth closely and allow your VMDKs to fill up your VMFS volumes. You've seen in your testing that thin provisioned disks tend to grow to fill their available space quickly and that they cannot reclaim space that may be free inside the OS.

The other option is creating sufficiently sized VMDK files to handle your current usage and expected spikes in growth and just add more VMDK files as your application data usage grows. New VMDK files can be added live to a VM, you just have to rescan (echo "- - -" > /sys/class/scsi_host/host?/scan). You can partition the new disk, add it to your LVM and extend the filesystem all live. This way you are always aware how much space is allocated to each of the VMs and you can't accidently run your VMFS out of space from inside a guest.

As far as whether to partition or not if the disk is only going to be used by LVM, I always partition. Partitioning the disk prevents any warnings about bogus partition tables from coming up when the machine boots and makes it clear that the disk is allocated. It's a bit of voodoo but I also make sure to start the partition at 64 to help make sure the partition and filesystem is block aligned with the underlying storage. It's hard to detect and categorize as you usually don't have something to easily compare against but if the OS filesystem isn't aligned properly with the underlying storage then you can end up with extra IOPS required to service requests which cross block boundaries on the underlying storage.

Related Solutions

Linux – Rescue disk is unable to see the lvm physical volumes

vgscan -vvvv

will give you very extensive output about why vgscan considers any specific volume being part of a volume group. You also could run pvs -a to see a summary of your physical volumes alongside with volume group assignments.

vgscan -vvvv output for one of the partitions:

Opened /dev/sdc3 RO /dev/sdc3: size is 3772817055 sectors 
Closed /dev/sdc3 /dev/sdc3: size is 3772817055 sectors 
Opened /dev/sdc3 RO O_DIRECT /dev/sdc3: block size is 512 bytes 
Closed /dev/sdc3 Using /dev/sdc3 
Opened /dev/sdc3 RO O_DIRECT /dev/sdc3: block size is 512 bytes 
/dev/sdc3: No label detected
Closed /dev/sdc3

pvs -a didn't reveal anything. All physical volumes listed without a volume group assignment

"no label detected" sounds pretty sad. You are sure that it is a LVM2 partition and not, say, a partition used by md-raid? You could check using mdadm --examine /dev/sdc3. And please post fdisk -l /dev/sdc

Yes I'm sure it is an LVM2 partition. The mdadm command gives "No md superblock detectd on /dev/sdc3" The fdisk says /dev/sdc3 is a Linux LVM partition.

Ah, then you will be in the lucky position (irony here, sorry) to try LVM recovery due to presumably damaged data structures. There is a howto about LVM recovery which might give you a starting point - try loading your VG configuration either from the disk itself (using dd if=/dev/sdc3 bs=512 count=255 skip=1) or from the /etc/lvm/backup folder of your former root filesystem (which I understand is on /dev/sdc1) into /etc/lvm/backup/ and re-issuing the vgscan command.

I tried that on both sda3 and sdc3 (as you can see, I have 3 lvm partitions to do this too) and they all result in binary files in the output text file. Ok, correction. There is some lvm meta data in the file, but it's several bytes into the file. I'm looking through the data, but it looks correct. I will keep trying to go through that restore process.

This is expected - both, VG and LV configs is cleartext within binary structures.

I ended up using a slightly modified process than what was outlined here. I ended up making a cfgbackup file from the byte data in the LVM, do a pvcreate, then a vgcfgrestore. After that, it worked. Thanks for the help.

Lvm – Creating raid1 array with mdadm cuts space

The reason /dev/md2 is smaller than /dev/sda2 is because there is a RAID superblock at the beginning of the partition /dev/sdb2. The superblock contains a unique identifier and also information on the other disks/partitions that make up the array, so the Linux kernel can automatically assemble the array upon boot, even if you change the order of the disks or copy the contents to a comletely new disk. It's a small overhead you pay in exchange for a lot of flexibility.

Of course, it prevents you from just mirroring /dev/sda2 to /dev/sdb2 since the size is different. If you continue reading the article you linked to, you have to create a filesystem in your (degraded) RAID array, copy the files over, change the boot loader to boot from /dev/md1 and mount /dev/md2, and then you can finally attach /dev/sda* as the second disk in your RAID configuration. It is possible, but not for the faint of heart... It's probably quicker, safer and easier to make a backup and re-install with RAID from the start.

Best Answer

Related Solutions

Linux – Rescue disk is unable to see the lvm physical volumes

Lvm – Creating raid1 array with mdadm cuts space

Related Topic