Debian Lenny – SAN – LVM Fail

data-recoverydebiandisaster-recoveryext3lvm

I've got a Lenny server that has got a SAN connection configured as the only PV for a VG named 'datavg'.

Yesterday, I've updated the box with Debian patches and gave it a reboot.

After the reboot, it didn't boot up saying that it couldnt find /dev/mapper/datavg-datalv.

This is what I did:

– booted in rescue-mode and commented the mount in /etc/fstab

– rebooted into full-user mode. (mountpoint is /data, only postgresql could not start)

– did vgdisplay, lvdisplay, pvdisplay to find out what happened to the volume group. (datavg was missing entirely)

After that, I noticed that the LUN is visible from Linux and that the LVM partition is also visible:

# ls -la /dev/mapper/mpath0*
brw-rw---- 1 root disk 254, 6 2009-11-23 15:48 /dev/mapper/mpath0
brw-rw---- 1 root disk 254, 7 2009-11-23 15:48 /dev/mapper/mpath0-part1

– Then, I tried pvscan in order to find out if it could find the PV. Unfortunately, it didnt detect the partition as a PV.

– I ran pvck on the partition, but it did not find any label:

# pvck /dev/mapper/mpath0-part1 
  Could not find LVM label on /dev/mapper/mpath0-part1

– Then, I was wondering if the LUN was perhaps empty, so I made a dd of the first few MB. In this, I could see the LVM headers:

datavg {
id = "removed-hwEK-Pt9k-Kw4F7e"
seqno = 2
status = ["RESIZEABLE", "READ", "WRITE"]
extent_size = 8192
max_lv = 0
max_pv = 0

physical_volumes {

pv0 {
id = "removed-AfF1-2hHn-TslAdx"
device = "/dev/dm-7"

status = ["ALLOCATABLE"]
dev_size = 209712382
pe_start = 384
pe_count = 25599
}
}

logical_volumes {

datalv {
id = "removed-yUMd-RIHG-KWMP63"
status = ["READ", "WRITE", "VISIBLE"]
segment_count = 1

segment1 {
start_extent = 0
extent_count = 5120

type = "striped"
stripe_count = 1        # linear

stripes = [
"pv0", 0
]
}
}
}
}

Note that this came from the partition where pvck could not find an LVM label!

– I decided to write a new LVM label to the partition and restore the parameters from the backup file.

pvcreate --uuid removed-AfF1-2hHn-TslAdx --restorefile /etc/lvm/backup/datavg  /dev/mapper/mpath0-part1

– Then I ran a vgcfgrestore -f /etc/lvm/backup/datavg datavg

– After that, I appears when I issue a pvscan.

– With a vgchange -ay datavg, I activated the VG and the LV came available.

– When I tried to mount the LV, it did not find any filesystem. I tried recovery in several ways, but did not succeed.

– After making a DD of the affected LV, I've tried to recreate the superblocks with

mkfs.ext3 -S /dev/datavg/backupdatalv

– but the result of this cannot be mounted:

# mount /dev/datavg/backupdatalv /mnt/
mount: Stale NFS file handle

The fact that this can happen in the first place is not very nice to say the least, so I want to find out everything I can about this malfunction.

My questions:

– How can it be that the LVM label disappears after patches and a reboot?

– Why is the filesystem not there after salvaging the PV? (Did the pvcreate command trash the data?)

– Is the ext3 filesystem in the LV still salvageable?

– Is there anything I could have done to prevent this issue?

Thanks in advance,
Ger.

Best Answer

I once ran into a similar problem. In our case, someone created a partition to hold the PV, but when they ran the pvcreate command, they forgot to specify the partition and instead used the whole device. The system ran fine until a reboot, when LVM could no longer find the PV.

So in your case, is it possible that someone ran "pvcreate /dev/mapper/mpath0" at the time of creation rather than "pvcreate /dev/mapper/mpath0-part1"? If so, you'll need to remove the partition table from the disk containing the PV.

From the pvcreate(8) man page to delete a partition table:

dd if=/dev/zero of=PhysicalVolume bs=512 count=1

The LVM code in the kernel will not recognize a whole device PV if there is a partition table on the device. Once we removed the partition table, the PV was recognized and we could access our data again.

Related Topic