Linux not picking up new partition correctly on emc pseudo device

clariionemclinux

We have a database server running oracle rac. We were recently running out of space on the main LUN that it is attached to. I created a new 100GB LUN and concatenated this onto the existing LUN creating a new MetaLUN. After some messing I managed to get linux to recognise the new space. I then created a new partition in on the pseudo device, to use the new space. Previously when I have done this on other system the next step is to create an ASM disk on the new partition and add this disk to the oracle disk group. This however fails. I am aware of various issues with ASM and powerpath, but I don't think this is the issue here. As on while investigating the issue I discovered that one of the underlying logical device is not reflecting the size change. See below;

Powermt displays all of the underlying logical units

[root@XXXXX~]# powermt display dev=emcpowerd  
Pseudo name=emcpowerd  
CLARiiON ID=CKM00091500009 [VFRAC2]  
Logical device ID=6006016030312200787502866C65DE11 [LUN 30]  
state=alive; policy=CLAROpt; priority=0; queued-IOs=0  

Owner: default=SP A, current=SP A       Array failover mode: 1  
`==============================================================================`
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---  
`###  HW Path                I/O Paths    Interf.   Mode    State  Q-IOs Errors`
`==============================================================================`
   3 qla2xxx                   sde       SP A0     active  alive      0      0  
   3 qla2xxx                   sdj       SP B0     active  alive      0      0  
   4 qla2xxx                   sdo       SP A1     active  alive      0      0  
   4 qla2xxx                   sdt       SP B1     active  alive      0      0  

**Fdisk on the pseudo device shows correct space.**

[root@XXXXX ~]# fdisk -l /dev/emcpowerd  

Disk /dev/emcpowerd: 429.4 GB, 429496729600 bytes  
255 heads, 63 sectors/track, 52216 cylinders  
Units = cylinders of 16065 * 512 = 8225280 bytes  

         Device Boot      Start         End      Blocks   Id  System  
/dev/emcpowerd1               1       39162   314568733+  83  Linux  
/dev/emcpowerd2           39163       52216   104856255   83  Linux  

**fdisk on one of the logical units is wrong**

[root@XXXXX~]# fdisk -l /dev/sde  

Disk /dev/sde: 322.1 GB, 322122547200 bytes    
255 heads, 63 sectors/track, 39162 cylinders  
Units = cylinders of 16065 * 512 = 8225280 bytes  

   Device Boot      Start         End      Blocks   Id  System  
/dev/sde1               1       39162   314568733+  83  Linux  
/dev/sde2           39163       52216   104856255   83  Linux  

**fdisk on the rest of the units is fine**

[root@XXXXX ~]# fdisk -l /dev/sdj  
Disk /dev/sdj: 429.4 GB, 429496729600 bytes  
255 heads, 63 sectors/track, 52216 cylinders  
Units = cylinders of 16065 * 512 = 8225280 bytes  
   Device Boot      Start         End      Blocks   Id  System  
/dev/sdj1               1       39162   314568733+  83  Linux  
/dev/sdj2           39163       52216   104856255   83  Linux

Also when I created the the partition linux did not create the any entries in the /dev directory for the second partition so I created these manually

[root@XXXXX dev]# mknod sde2 b 8 66
[root@XXXXX dev]# ls -al sd[ejot]?  
brw-r----- 1 root disk  8,  65 Dec 29 14:20 sde1  
brw-r--r-- 1 root disk  8,  66 Apr  8 20:31 sde2  
brw-r----- 1 root disk  8, 145 Dec 29 14:19 sdj1  
brw-r--r-- 1 root disk  8, 146 Apr  8 20:33 sdj2  
brw-r----- 1 root disk  8, 225 Apr  6 23:12 sdo1  
brw-r--r-- 1 root disk  8, 226 Apr  8 20:33 sdo2  
brw-r----- 1 root disk 65,  49 Dec 29 14:19 sdt1  
brw-r--r-- 1 root disk 65,  50 Apr  8 20:33 sdt2

This is a production server that we cannot easily reboot.

Any ideas would be much appreciated.

Best Answer

Apart from partprobe, try using the blockdev utility to reread partition table of the device:

blockdev --rereadpt /dev/sde

Then, the problem may be that the LUN itself hasn't been properly updated.

You can try issuing a rescan command against the Fibre Channel or SCSI host through the /sys filesystem.

Some time ago, I've written this scsi_rescan_bus.sh script for coping with our EMC Clariion devices:

#!/bin/sh
host_number="$1"
echo "1" > /sys/class/fc_host/host${host_number}/issue_lip
sleep 10
echo "- - -" > /sys/class/scsi_host/host${host_number}/scan

I'm not entirely sure that it would still work with the modern kernels and devices. Always test this on a dedicated test environment before trying this in production!

There are numerous gotchas, so make sure you read these relevant threads:

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1454807

And the official Red Hat documentation ("Online Storage Reconfiguration Guide"): http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/index.html

Assuming that your /dev/md5 was never used in the LVM:

(...had you ever looked at pvscan before today?)

If you don't have backups, now is the time to start. If you do, now is the time to test them (and if they don't work, you don't have backups, see step 1).

There isn't an easy way out of this mess, and I haven't got a clue what might happen if you reboot at this point (can you unmount the filesystem?). If I was certain that what really happened was that sdj had been added as both a raid drive and as an lvm physical volume (since the lvm wasn't using the raid driver to write to sdj, none of the data written to sdj would be on sdk... perhaps this can be verified by comparing hex dumps of various chunks of /dev/sdj and /dev/sdk and someone smarter than me who knows good places to look for things that would say "this is XFS" versus "this is random gibberish or a blank drive"?), then what I'd do is this:

Start by trying to get SMART data on sdk to see if it is trustworthy or on the way out.

If sdk is good, then I would thank my lucky stars for the former admin having wasted 63GB of /dev/sdj.

fdisk /dev/sdk

(doublecheck EVERYTHING before hitting return). Have fdisk create a partition table and an md partition (mdadm manpage says use 0xDA, but every walkthrough and my own experience says 0xFD for raid autodetect), then

mdadm --create /dev/md6 --level=1 --raid-devices=2 missing /dev/sdk1

(doublecheck EVERYTHING before hitting return). This will create a degraded raid1 array named md6 using the partition we made on sdk. These next steps are why that wasted space is important: we've lost some space due to the md superblock and due to the partition table, so our /dev/md6 is slightly smaller than /dev/sdj was. We're going to add /dev/md6 to the dedvol volume group and instruct LVM to move the 1.82TB of logical volume from /dev/sdj to /dev/md6. LVM can handle the filesystem being active while it does this.

pvcreate /dev/md6
vgextend dedvol /dev/md6
pvmove -v /dev/sdj

(doublecheck... you get the picture. I'd also run pvscan after pvcreate and again after vgextend to make sure things look right). This will begin the process of moving all the data allocated to /dev/sdj to /dev/md6 (specifically, the command moves everything off sdj, and md6 is the only place for it to go). Several hours later either this will complete or the system will lock up trying to read from sdj. If the system crashes, you can reboot and try pvmove without a device name to restart at the last checkpoint or just give up and reinstall from backups.

If we succeed, we remove /dev/sdj from the volume group, then remove it as a physical volume:

vgreduce dedvol /dev/sdj
pvremove /dev/sdj

Now, for the corruption-checking part. The tool for checking and fixing xfs is xfs_repair (fsck will run on an xfs filesystem but it does nothing at all). The bad news? It uses gigs of RAM per terabyte of filesystem, so hopefully you have a 64 bit server with a 64 bit kernel and the 64 bit xfs_repair binary (which might be named xfs_repair64) and at least 10GB of RAM+Swap (you should be able to use some of that leftover empty space in dedvol to create a swap volume, then mkswap that volume, then swapon that volume). The filesystem must be unmounted before running xfs_repair on it. Also, xfs_repair can detect and (attempt to) fix damage to the filesystem itself, but it may not detect damage to the data (for instance, something overwriting part of a directory inode versus something overwritten in the middle of a text file).

Finally, we need to buy a new /dev/sdj, install it, and add it to that degraded /dev/md6, keeping in mind that if we reboot the computer without sdj in it, it is possible sdk will move down to sdj and the new drive will be sdk instead (probably not, but best to be sure):

fdisk /dev/sdj

check to make sure that it isn't the drive we partitioned and set up already, then create a partition for md on it

mdadm /dev/md6 -a /dev/sdj1

(It is entirely possible that the errors could be due to raid and lvm duking it out over the content of sdj, rather than the drive actually failing (usually failing drives generate a lot of gibberish from the driver in dmesg rather than just Input/Output errors) but I'm not sure I'd risk it.)

Best Answer

Related Solutions

Linux – how can I boot linux from a software raid 1 array

Linux – /dev/md device disappeared in Linux RAID1 array

Assuming that your /dev/md5 was never used in the LVM:

Related Topic