Dell PowerEdge R720 – Corrupted RAID

corruptionfsckhard driveraid5ubuntu-12.04

Apologies in advance for the lengthy question.

We have a Dell PowerEdge R720 server with:

  • 2 x 136GB SAS drives in RAID 1 for the OS (Ubuntu Server 12.04)
  • 6 x 3TB SATA drives in RAID 5 for data

A few days ago we were getting errors when trying to access files on the large RAID 5 partition. We rebooted the server and got a message about the raid controller has found a foriegn config. We've had this before, and just needed to use Dell's RAID configuration utility to import foreign config on the RAID. Last time this worked, but this time, it started doing a disk check then we got this:

FSCK has returned the following:

"/dev/sdb1 inode 364738 has a bad extended attribute block 7

/dev/sdb1 unexpected inconsistency run fsck manually (i.e without -a or -p options) 

MOUNTALL fsck /ourdatapartition [1019] terminated with status 4

MOUNTALL filesystem has errors /ourdatapartition

errors where found while checking the disk drive for /ourdatapartition

Press F to fix errors, I to Ignore or M for Manual Recovery"

We pressed F to try and fix the errors, but it eventually errored with:

Inode 275841084, i_blocks is 167080, should be 0. Fix? yes

Inode 275841141 has an invalid extend node (blk 2206761006, lblk 0)
Clear? yes

Inode 275841141, i_blocks is 227872, should be 0. Fix? yes

Inode 275842303 has an invalid extend node (blk 2206760975, lblk 0)
Clear? yes

....


Error storing directory block information (inode=275906766, block=0, num=2699516178):         Memory allocation failed

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
e2fsck: aborted

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
mountall: fsck /ourdatapartition [1286] terminated with status 9
mountall: Unrecoverable fsck error: /ourdatapartition

We noticed one of the drive lights was not lit at all, and thought this may have failed and be the problem. We replaced the drive with a spare, and tried "F" to repair it again, but we keep just getting the same error as above.

In the RAID configuration utility, all drives show as "online" and "optimal".

We do have this data on another replicated server, so we're not worried about "recovering" anything, we just want to get the system back online asap.

The server has 64 or 32GB memory, can't remember off the top of my head, but either way, with a 14TB RAID, I think it may still not be enough.

Thanks

EDIT – I checked the memory usage while fsck was running as suggested and after 2 or 3 minutes, it looked like this, using up nearly all of our servers memory:

During FSCK Memory Usage

When it failed after 5 minutes or so with the error in my post, the memory immediately freed up again:

After FSCK Error Memory Usage

EDIT 2 – I did a check for badblocks sudo badblocks -nvs /dev/sdb1, but it came back with Pass completed, 0 bad blocks found. (0/0/0 errors)

Best Answer

It really does look like that filesystem is hosed. As you have the data on another server and you don't need to recover data from the old filesystem, you should be able to newfs the partition to create a blank filesystem.

mkfs /dev/sdb1

and be done with it.

Related Topic