Ubuntu – Upgraded Ubuntu, all drives in one zpool marked unavailable

Ubuntuzfsonlinux

I just upgraded Ubuntu 14.04, and I had two ZFS pools on the server. There was some minor issue with me fighting with the ZFS driver and the kernel version, but that's worked out now. One pool came online, and mounted fine. The other didn't. The main difference between the tool is one was just a pool of disks (video/music storage), and the other was a raidz set (documents, etc)

I've already attempted exporting and re-importing the pool, to no avail, attempting to import gets me this:

root@kyou:/home/matt# zpool import -fFX -d /dev/disk/by-id/
   pool: storage
     id: 15855792916570596778
  state: UNAVAIL
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

        storage                                      UNAVAIL  insufficient replicas
          raidz1-0                                   UNAVAIL  insufficient replicas
            ata-SAMSUNG_HD103SJ_S246J90B134910       UNAVAIL
            ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523  UNAVAIL
            ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969  UNAVAIL

The symlinks for those in /dev/disk/by-id also exist:

root@kyou:/home/matt# ls -l /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910* /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51*
lrwxrwxrwx 1 root root  9 May 27 19:31 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910 -> ../../sdb
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part9 -> ../../sdb9
lrwxrwxrwx 1 root root  9 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523 -> ../../sdd
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part9 -> ../../sdd9
lrwxrwxrwx 1 root root  9 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969 -> ../../sde
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part9 -> ../../sde9

Inspecting the various /dev/sd* devices listed, they appear to be the correct ones (The 3 1TB drives that were in a raidz array).

I've run zdb -l on each drive, dumping it to a file, and running a diff. The only difference on the 3 are the guid fields (Which I assume is expected). All 3 labels on each one are basically identical, and are as follows:

version: 5000
name: 'storage'
state: 0
txg: 4
pool_guid: 15855792916570596778
hostname: 'kyou'
top_guid: 1683909657511667860
guid: 8815283814047599968
vdev_children: 1
vdev_tree:
    type: 'raidz'
    id: 0
    guid: 1683909657511667860
    nparity: 1
    metaslab_array: 33
    metaslab_shift: 34
    ashift: 9
    asize: 3000569954304
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 8815283814047599968
        path: '/dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part1'
        whole_disk: 1
        create_txg: 4
    children[1]:
        type: 'disk'
        id: 1
        guid: 18036424618735999728
        path: '/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part1'
        whole_disk: 1
        create_txg: 4
    children[2]:
        type: 'disk'
        id: 2
        guid: 10307555127976192266
        path: '/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part1'
        whole_disk: 1
        create_txg: 4
features_for_read:

Stupidly, I do not have a recent backup of this pool. However, the pool was fine before reboot, and Linux sees the disks fine (I have smartctl running now to double check)

So, in summary:

  • I upgraded Ubuntu, and lost access to one of my two zpools.
  • The difference between the pools is the one that came up was JBOD, the other was zraid.
  • All drives in the unmountable zpool are marked UNAVAIL, with no notes for corrupted data
  • The pools were both created with disks referenced from /dev/disk/by-id/.
  • Symlinks from /dev/disk/by-id to the various /dev/sd devices seems to be correct
  • zdb can read the labels from the drives.
  • Pool has already been attempted to be exported/imported, and isn't able to import again.

Is there some sort of black magic I can invoke via zpool/zfs to bring these disks back into a reasonable array? Can I run zpool create zraid ... without losing my data? Is my data gone anyhow?

Best Answer

After lots and lots more Googling on this specific error message I was getting:

root@kyou:/home/matt# zpool import -f storage
cannot import 'storage': one or more devices are already in use

(Included here for posterity and search indexes) I found this:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/VVEwd1VFDmc

It was using the same partitions and was adding them to mdraid during any boot before ZFS was loaded.

I remembered seeing some mdadm lines in dmesg and sure enough:

root@kyou:/home/matt# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md126 : active raid5 sdd[2] sdb[0] sde[1]
      1953524992 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

These drives were, once upon a time, part of a software raid5 array. For some reason, during the upgrade, it decided to rescan the drives, and find that the drives were once part of an md array, and decided to recreate it. This was verified with:

root@kyou:/storage# mdadm --examine /dev/sd[a-z]

Those three drives showed a bunch of information. For now, stopping the array:

root@kyou:/home/matt# mdadm --stop /dev/md126
mdadm: stopped /dev/md126

And re-running import:

root@kyou:/home/matt# zpool import -f storage

has brought the array back online.

Now I make a snapshot of that pool for backup, and run mdadm --zero-superblock on them.