Zfs pool error, how to determine which drive failed in the past

hard drivesolariszfs

I had been copying data from my pool so that I could rebuild it with a different version so that I could go away from solaris 11 and to one that is portable between freebsd/openindia etc. it was copying at 20mb a sec the other day which is about all my desktop drive can handle writing from the network. suddently lastnight it went down to 1.4mb i ran zpool status today and got this.

   pool: store
   state: ONLINE
   status: One or more devices has experienced an unrecoverable error.  An
          attempt was made to correct the error.  Applications are unaffected.
   action: Determine if the device needs to be replaced, and clear the errors
          using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
   scan: none requested
   config:

    NAME          STATE     READ WRITE CKSUM
    store         ONLINE       0     0     0
      raidz1-0    ONLINE       0     0     0
        c8t3d0p0  ONLINE       0     0     2
        c8t4d0p0  ONLINE       0     0    10
        c8t2d0p0  ONLINE       0     0     0

it is currently a 3 x1tb drive array. what tools would best be used to determine what the error was and which drive is failing.

per the admin doc

 The second section of the configuration output displays error statistics. These errors are divided into three categories:

READ – I/O errors occurred while issuing a read request.

WRITE – I/O errors occurred while issuing a write request.

CKSUM – Checksum errors. The device returned corrupted data as the result of a read request.

it was saying low counts could be any thing from a power flux to a disk event but gave no suggestions as to what tools to check and determine with.

Best Answer

Checksum errors occur when data was read from disk, but it didn't match the expected checksum; a noisy sata cable could cause this corruption either during writing (data corrupted on the way to disk) or reading (data corrupted on the way from the disk). Although it could be a failing disk, it was likely caused by a loose or pinched SATA data cable. Try reseating the cables on both ends or trying another known good cable.

As for determining which disk, kind of depends on what hardware you're using. For Sun branded hardware cfgadm -alv should give you hard drive serial numbers to match their logical names. If you're using SATA ports on the motherboard, the port numbers correspond to the target id (2, 3, 4) so the first port is probably t0. Most of my disks have WWN printed on the label, you can discover this by enabling multipathing with pfexec stmsboot -e (see: this question) which will use the c8tWWNxxxxxxxxd0p0 format instead of c8tNd0p0, but probably only if you're using a SAS controller.

Your output shows ZFS was able to correct the error by reconstructing the data from the other two disks and restore the redundancy. It's just letting you cause something bad happened, at this point the fault management system has not yet decided the disk has had sufficient errors to warrant offlining it (resulting in a 'degraded' pool status). I'd give it a scrub to make sure every byte reads cleanly. More info for error ZFS-8000-0P here.