Linux – Bad Blocks Exist in Virtual Device PERC H700 Integrated

bad-blocksdell-perch700linuxmegacli

I have a DELL server with PERC H700 Integrated controller. I've made RAID5 with 12 harddrives and the virtual device is in Optimal state, but I receive such errors under linux:

sd 0:2:0:0: [sda] Unhandled error code
sd 0:2:0:0: [sda]  Result: hostbyte=0x07 driverbyte=0x00
sd 0:2:0:0: [sda] CDB: cdb[0]=0x88: 88 00 00 00 00 07 22 50 bd 98 00 00 00 08 00 00
end_request: I/O error, dev sda, sector 30640487832
sd 0:2:0:0: [sda] Unhandled error code
sd 0:2:0:0: [sda]  Result: hostbyte=0x07 driverbyte=0x00
sd 0:2:0:0: [sda] CDB: cdb[0]=0x88: 88 00 00 00 00 07 22 50 bd 98 00 00 00 08 00 00
end_request: I/O error, dev sda, sector 30640487832
sd 0:2:0:0: [sda] Unhandled error code
sd 0:2:0:0: [sda]  Result: hostbyte=0x07 driverbyte=0x00
sd 0:2:0:0: [sda] CDB: cdb[0]=0x88: 88 00 00 00 00 07 22 50 bc e0 00 00 01 00 00 00
end_request: I/O error, dev sda, sector 30640487648

But all disk are in Firmware state: Online, Spun Up.
Also there is not a single ATA read or write error in any disk in the raid (I check them with smartctl -a -d sat+megaraid,N -H /dev/sda). The only strange thing is in the output in

megacli:
megacli -LDInfo -L0 -a0
...
Bad Blocks Exist: Yes

How could there be bad blocks in a Virtual Drive, which is in optimal state and no disk is broken or even with a single error? I tried "Consistency Check", but it finished successfully and the errors are still in dmesg. Could Someone help me to figure it out what is wrong with my raid?

Best Answer

The "Bad blocks exist" indicator of MegaCLI refers to the Soft Bad Block Management table which works as follows (quote from the MegaRaid docs):

If the CU detects a media error on the source drive during rebuild, it initiates a sector read for that block. If the sector read fails, the CU adds entries to the Soft Bad Block Management (SBBM) table, writes this table to the target drive, and displays an error message.

Additional error messages are displayed if the SBBM table is 80% full or 100% full. If the SBBM table is completely full, the rebuild operation is aborted, and the drive is marked as FAIL.

The SBBM table would not contain the same "bad" markings as what is reported by SMART as the criteria and methods of action are very different.

Take a look at which of your drives is reporting errors using megacli -LDPDInfo -aAll and give it a closer examination.