Proper procedure for replacing failing drive in Xserve RAID RAID5 set w/hot spare

appleraid

I've got a five-drive RAID-5 set (with a sixth hot spare) in an Xserve RAID running the 1.5/1.50f firmware. One of the drives in the RAID-5 set has an amber/orange status light on and has been getting occasional errors like to following:

Timestamp:  11/10/10 10:34:53 AM
Priority:   Warning
Controller: Upper Controller
Type:   112
Event ID:   1000
Event:  Disk 5 Reported An Error. COMMAND:0x35 ERROR:0x10 STATUS:0x51 LBA:0x19B80
Description:    The drive reported an ATA error. This is a failure in the communication from the RAID Controller to the drive.

I have double checked the drives in RAID Admin and, as the drive is only in a warning state, the hot spare has not been pulled into the RAID set yet. As this is an old drive, I'd like to replace that particular drive first. I have a current, full backup of the data, but want to make sure I understand the process correctly.

I understand the "Installing or Replacing an Apple Drive Module" section of http://manuals.info.apple.com/en/XserveRAID_UserGuide.PDF, but it and RAID Admin's built-in help don't describe what will happen when replacing a drive in a RAID set that has a hot spare. When I pull out the drive and replace it, will it correctly use the newly inserted drive or will it use the hot spare? If it uses the hot spare, will the hot spare revert back to a hot spare once the new drive is inserted or will it permanently become a member of the RAID set and need to be moved to the original drive's slot? Or, should I just pull out the hot spare, pull out the failing drive, and pop the hot spare into the failing drive's slot?

Best Answer

According to the manual at http://manuals.info.apple.com/en_US/RAIDAdmin1.2_121406.pdf, any drives not part of a disk group or array will be treated as global hot spares (as per section "Creating RAID Array"), and will automatically rebuild upon loss or failure of a drive.

It seems like your drive isn't in a failing state, but as others have mentioned, if you pull the drive, it should force the XServe to start rebuilding the parity on the spare drive. However, during this time of the rebuild you can't pull any of the other drives or you'll lose the data. I'm not familiar with the RAID tools involved, but it should give you some kind of monitoring interface to see how far along it is.

In my Dell MD3000i system, when the drive fails or is pulled, the hot spare kicks in immediately, and when a replacement drive is inserted, after the rebuild it starts what is known as a "copy-back" and replicates the hot spare back onto the replacement, at which point the spare goes back to being a spare again. Based on what I've read in the manual, though, it looks like the XServe makes the spare drive a part of the array, so a best guess would be that your replacment drive will end up being the hot spare again, since it's not part of the array:

"The RAID controller that controls the affected array will automatically attempt to reconstruct the data in order to return the system to a protected state. For example, if a hot spare drive is available when a drive fails in an array, the controller takes the available drive and integrates it into the array. The controller then rebuilds the RAID array using the new drive."