Linux – Rebuilding RAID 10 array with two failed drives

hard drivehardware-raidlinuxraid10storage

I have dedicated server with 4 HDD in hardware RAID 10 configuration and It worked fine until yesterday, when It started to crash randomly on couple minutes. I’ve contacted my data center and they’ve run a system diagnostics and they found that one of my HDD in the RAID 10 array was defective, they replaced the drive and it started rebuilding itself automatically. Then they’ve booted the system in normal mode and it was working normally for 15-minutes when it started to crash again. I made couple of diagnostics on my own and when I checked state of the physical drives with:

arcconf GETCONFIG 1 PD

I’ve noticed that the HDD 0,0 have S.M.A.R.T errors, I reported that to my DC and they confirmed this and requested to swap that device with new one, but they suggested me to make backup of my data (~2TB) because it’s very likely to lose my data. I’ve made backup of my data and then they replaced the second HDD. After booting they needed to make force start of RAID controller and the system booted in recovery mode. I think that they swapped the wrong drive first time because it’s highly unlikely two drives to fail at the same time in different mirror sets but that is another story to tell…
My problem is that the second replaced HDD isn’t rebuilding it self. I’ve tried to clear the metadata for that drive with:

arcconf TASK START 1 DEVICE 0 0 CLEAR

and than set the state of the drive as hot spare with

arcconf SETSTATE 1 DEVICE 0 0 HSP LOGICALDRIVE 0

so it to begin rebuild process automatically but without success.

My RAID 10 array data is 4 HDD drives HDD 0,0 and HDD 0,1 are in mirror set and HDD 0,2 and HDD 0,3 in another.

The output from logical device state is: arcconf getconfig 1 ld

https://dl.dropbox.com/u/10839791/ld.txt

And the output from physical drive state is: arcconf GETCONFIG 1 PD

https://dl.dropbox.com/u/10839791/pd.txt

Controller status:

https://dl.dropbox.com/u/10839791/controller.txt

My questions is is there any way to make that drive rebuild it’self without loosing any data.

Thanks.

Best Answer

I think the answer may be that the adaptec controller will only rebuild 1 drive at a time.

I have an Adaptec 5805Z controller in a RAID 10 with 4 groups. We just replaced 1 drive out of each group and only 1 group is rebuilding at the moment. I know that all of the replacement drives are good because we ran badblocks on them, further, they are definitely larger than the drives they are replacing.

@SkechBoy, do you know if your first group rebuild finished before the second one started?

Update: Just received confirmation from Adaptec that the "controller will usually rebuild a segment at a time". In other words, you have to wait for the first RAID group to be rebuilt before it will start rebuilding the second one.