Sparing level on HP EVA 4000

hpraidstoragestorage-area-network

One of the disks of our EVA4000 died today. This diskgroup (all volumes vraid5 with sparing level 1 and almost no space left for more volumes, 1TiB drives) is being rebuilt with "spare space" right now, and it will take at least 15 hours to do the leveling/rebuilding.

We can't get a new disk until Friday. So, the question is, what would happen if another disk dies before the leveling completes? Would we lose data? And after that, how many aditional disks could die before losing data? 1 or 2?

In "usual" RAID, we would be vulnerable to data loss while the rebuild takes place, but in this case the space reserved for sparing is two times the size of the bigger disk, so at the very least the effect should be the same of having two spares.

Thanks in advance.

Update: I have found some interesting threads about this question but still can't answer to this question, so I'm starting a bounty.

http://blog.thestoragearchitect.com/2008/10/27/understanding-eva/

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&url=http%3A%2F%2Fwww.experts-exchange.com%2FStorage%2FStorage_Technology%2FQ_25548177.html (Expert Exchange question from google).

Best Answer

Short version

Leveling is the process after the rebuilding. If your array is leveling, you are just as safe as you were before the disk failed.

Long version

When you lose a disk, EVA will automatically try to use any of the space on the remaining healthy disks to create a redundant copy of the data that used to be on that disk. If you had one volume group with one big virtual disk with Vraid5 parity and you lost a single disk, the EVA will regenerate the data that used to be on the failed disk on the free space on the first disk. If there isn't enough space it will use 2, 3 or more disks but you will get a redundant copy of your data in the shortest time possible. How long that takes, I cannot tell you. But you will be back to the "you can lose a disk and not lose your data" state in a very short time. That is, of course, if you have enough free space in your disks.

You mentioned sparing. I am not familiar with this term but I hope you are talking about "failure protection level" which is the space that the EVA will reserve for an emergency like the one you are describing. Single protection level means that it will reserve the size of two of your largest disks, and double - the size of four disks. EVA will not report this space as free. So if you have single protection level and are using 95% with 16 1TB disks, you will have 2TB reserved, and are only using 95% of the remaining 14TB. That is 13.3TB used, and 2.7TB free. And if you take the Vraid5 into account, that is 10.64TB usable space and 2.66TB wasted for parity.

Once the EVA has made a redundant copy on as few disks as possible, it will start leveling (I personally prefer to call it "balancing") the data. This process involves moving the data around so all your disks end up with approximately the same amount of data in the end. This process takes awfully long time, especially if your usage is quite high, but you are safe if you have another failure at this time.

Go in Command View and check the status of the volume group. If it says that it is leveling - you are just as safe as you used to be before the failure.

You are now down to 15TB of raw disk space and you are using 13.3TB. The EVA wants to maintain a single protection level but it cannot reserve 2TB (you only have 1.7TB unused) so it is probably reporing the requested protection level as single, and the actual protection level as none. It may also be reporting your usage as going over 100%, since you are using 13.3TB and to satisfy the single protection requirement you should be under 13TB (15TB total - 2TB reserved for single protection).

This still means that you can still lose another disk, and you will still have a healthy storage. You can lose a second disk, and it will be the Vraid5 redundancy that is going to protect your data (though you may see a degradation in performance). And of course, if you are lucky you may survive a third and a fourth disk failure, as long as they are not in the same Vraid stripe (EVA's Vraid5 is more like RAID5+0, with stripes spanning over 5 disks).

Update: Unrelated to your question, but the latest FATA firmware update has a "Fix for self-initiated resets that may occur under rare circumstances". Believe me, it does not feel nice to see disks get thrown out of a volume group for no reason.

Update 2: Updated because single protection level means the space for two disks.

Related Topic