HP Storageworks MSA60, one failed disk replaced, all 12 bays flashing rapidly

hphp-prolianthp-smart-arrayraid6

I replaced a faulty disk yesterday afternoon, and the rebuild process started (RAID6). All disks were previously 750GB, but the only disk I could get is 1TB. Right now, almost 18 hours after replacing the disk, all 12 disks are flashing rapidly (yellow) on the lower light indicator. Does this mean it's still rebuilding, or is something terribly wrong?

Best Answer

Haven't seen that, when our MSA60 rebuilds I think that only the new disk is flashing, but I might be wrong.

I suggest you start by checking status on the HP Array Configuration Utility, and if that doesn't make sense call HP for assistance.

Short version

Leveling is the process after the rebuilding. If your array is leveling, you are just as safe as you were before the disk failed.

Long version

When you lose a disk, EVA will automatically try to use any of the space on the remaining healthy disks to create a redundant copy of the data that used to be on that disk. If you had one volume group with one big virtual disk with Vraid5 parity and you lost a single disk, the EVA will regenerate the data that used to be on the failed disk on the free space on the first disk. If there isn't enough space it will use 2, 3 or more disks but you will get a redundant copy of your data in the shortest time possible. How long that takes, I cannot tell you. But you will be back to the "you can lose a disk and not lose your data" state in a very short time. That is, of course, if you have enough free space in your disks.

You mentioned sparing. I am not familiar with this term but I hope you are talking about "failure protection level" which is the space that the EVA will reserve for an emergency like the one you are describing. Single protection level means that it will reserve the size of two of your largest disks, and double - the size of four disks. EVA will not report this space as free. So if you have single protection level and are using 95% with 16 1TB disks, you will have 2TB reserved, and are only using 95% of the remaining 14TB. That is 13.3TB used, and 2.7TB free. And if you take the Vraid5 into account, that is 10.64TB usable space and 2.66TB wasted for parity.

Once the EVA has made a redundant copy on as few disks as possible, it will start leveling (I personally prefer to call it "balancing") the data. This process involves moving the data around so all your disks end up with approximately the same amount of data in the end. This process takes awfully long time, especially if your usage is quite high, but you are safe if you have another failure at this time.

Go in Command View and check the status of the volume group. If it says that it is leveling - you are just as safe as you used to be before the failure.

You are now down to 15TB of raw disk space and you are using 13.3TB. The EVA wants to maintain a single protection level but it cannot reserve 2TB (you only have 1.7TB unused) so it is probably reporing the requested protection level as single, and the actual protection level as none. It may also be reporting your usage as going over 100%, since you are using 13.3TB and to satisfy the single protection requirement you should be under 13TB (15TB total - 2TB reserved for single protection).

This still means that you can still lose another disk, and you will still have a healthy storage. You can lose a second disk, and it will be the Vraid5 redundancy that is going to protect your data (though you may see a degradation in performance). And of course, if you are lucky you may survive a third and a fourth disk failure, as long as they are not in the same Vraid stripe (EVA's Vraid5 is more like RAID5+0, with stripes spanning over 5 disks).

Update: Unrelated to your question, but the latest FATA firmware update has a "Fix for self-initiated resets that may occur under rare circumstances". Believe me, it does not feel nice to see disks get thrown out of a volume group for no reason.

Update 2: Updated because single protection level means the space for two disks.

Best Answer

Related Solutions

Proliant RAID 1 Rebuild Questions

Sparing level on HP EVA 4000

Short version

Long version

Related Topic