Linux – LSI RAID controller errors on DB import – How to troubleshoot

hardware-raidlinuxoraclerhel5

We're running an import of a database dump on an Oracle system – (RHEL 5.9, 2.6.18-348.6.1.el5). The import does not complete, eventually erroring-out with:

ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 248 logical extent 0 of file 280 in group 1 on disk 1 allocation unit 986
Errors in file /u01/app/oracle/diag/rdbms/dbprod/DBPROD/trace/DBPROD_lgwr_24520.trc:
ORA-00345: redo log write error block 509314 count 2023
ORA-00312: online log 1 thread 1: '+DATA/dbprod/redo01.log'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk

There are corresponding errors in the ring buffer and /var/log/messages:

Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io  error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io  error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: megasas: build_ld_io  error, sge_count = 51
Jun 12 18:54:42 db1-test kernel: megasas: Err returned from build_and_issue_cmd
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: timing out command, waited 360s
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: Unhandled error code
Jun 12 18:54:42 db1-test kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
Jun 12 18:54:42 db1-test kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK

The drive array containing the import is a 10-disk SAS array in RAID 1+0 using 300GB 10k disks. The RAID controller is an LSI MegaRAID SAS 9260-8i. No disk or adapter errors are reported via MegaCLI.

  • Is this a hardware issue?
  • Is there any way to troubleshoot? The RAID controller status is fine. The disks and logical drives report healthy.
  • Is this a Linux OS or tuning issue? I'll try with different I/O schedulers to be sure. CFQ is default.

Edit:

Other schedulers have been tried with the same result. There is a third-party (Vormetric) filesystem encryption module running in this setup. Removing it allows the import to complete. So now I'm wondering if this is a deficiency in the module or if it is triggering a bad condition in the LSI driver.


During the import, we're hitting 14,000 write IOPS.
enter image description here

In recent attempts, the system stalls entirely with the following on the console.
enter image description here

Last top output before freeze.
enter image description here

Best Answer

Ultimately Sergey is right - this is a driver problem. But let's check things out first:

First off you'll want to use the deadline I/O scheduler rather than CFQ. deadline, as its name implies, ensures that all IOPs complete in a timely manner.

Grab the events from the megaraid card:

megacli -adpeventlog -getevents -f /tmp/megaraid-$(date +%F_%T) -aALL

Check the SMART data on the disks (you will need to build a new smartmontools for this to work):

# megacli -pdlist -a0 |grep 'Device Id'
Device Id: 10
Device Id: 9

# smartctl -a /dev/sda -d megaraid,9
«…»
# smartctl -a /dev/sda -d megaraid,10
«…»

If everything looks OK, go ahead and try out the latest driver from LSI.


There is a third-party (Vormetric) filesystem encryption module running in this setup. Removing it allows the import to complete. So now I'm wondering if this is a deficiency in the module or if it is triggering a bad condition in the LSI driver.

The Voretric module is likely doing something incompatible, yes. I would start by talking with them about how their module is screwing up your system under high load.