Debian – HP DL360 G7 P410i controller troubleshooting

debianhphp-prolianthp-smart-array

Server is HP DL360 G7 with P410i disk controller. 2xE5620 CPU's. 16GB RAM. Linux mysql 2.6.32-5-amd64 #1 SMP Mon Feb 25 00:26:11 UTC 2013 x86_64 GNU/Linux (Debian 6.0.7)
hpacucli "ctrl all show status"

Smart Array P410i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

hpacucli "ctrl all show config"

Smart Array P410i in Slot 0 (Embedded)    (sn: 5001438014555B80)

   array A (SAS, Unused Space: 0 MB)


      logicaldrive 1 (136.7 GB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 72 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 72 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 72 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 72 GB, OK)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: 5001438014555B8F)

hpacucli "ctrl slot=0 ld all show"

Smart Array P410i in Slot 0 (Embedded)

   array A

      logicaldrive 1 (136.7 GB, RAID 1+0, OK)

I run fallowing script via night:

#!/bin/bash
mkdir -p /isotest
for i in {1..200}; do
    for j in {1..55}; do cp -v /root/ubuntu.iso /isotest/ubuntu.iso${j}; done
    rm /isotest/ubuntu.iso*;
done

/root/ubuntu.iso size is abou 2 GB.

in syslog has some errors. I think that it is related to disk controller:

Mar 28 06:59:17 mysql kernel: [850337.524306] INFO: task mandb:25565 blocked for more than 120 seconds.
Mar 28 06:59:17 mysql kernel: [850337.524337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 06:59:17 mysql kernel: [850337.524381] mandb         D ffff88022740fa20     0 25565  25197 0x00000000
Mar 28 06:59:17 mysql kernel: [850337.524385]  ffff88041ec4b880 0000000000000082 0000000000000000 000000009d778d11
Mar 28 06:59:17 mysql kernel: [850337.524388]  ffffea000defe260 ffffea000defe260 000000000000f9e0 ffff88014d913fd8
Mar 28 06:59:17 mysql kernel: [850337.524390]  00000000000157c0 00000000000157c0 ffff88013228a350 ffff88013228a648
Mar 28 06:59:17 mysql kernel: [850337.524393] Call Trace:
Mar 28 06:59:17 mysql kernel: [850337.524404]  [<ffffffff810168ec>] ? read_tsc+0xa/0x20
Mar 28 06:59:17 mysql kernel: [850337.524408]  [<ffffffff8106bdca>] ? timekeeping_get_ns+0xe/0x2e
Mar 28 06:59:17 mysql kernel: [850337.524412]  [<ffffffff810b4761>] ? sync_page+0x0/0x46
Mar 28 06:59:17 mysql kernel: [850337.524416]  [<ffffffff812fc8f2>] ? io_schedule+0x73/0xb7
Mar 28 06:59:17 mysql kernel: [850337.524418]  [<ffffffff810b47a2>] ? sync_page+0x41/0x46
Mar 28 06:59:17 mysql kernel: [850337.524421]  [<ffffffff812fcd02>] ? __wait_on_bit_lock+0x3f/0x84
Mar 28 06:59:17 mysql kernel: [850337.524423]  [<ffffffff810b472e>] ? __lock_page+0x5d/0x63
Mar 28 06:59:17 mysql kernel: [850337.524426]  [<ffffffff810652e0>] ? wake_bit_function+0x0/0x23
Mar 28 06:59:17 mysql kernel: [850337.524428]  [<ffffffff810b473d>] ? lock_page+0x9/0x1f
Mar 28 06:59:17 mysql kernel: [850337.524431]  [<ffffffff810b4853>] ? find_lock_page+0x25/0x45
Mar 28 06:59:17 mysql kernel: [850337.524433]  [<ffffffff810b4e63>] ? filemap_fault+0x1a5/0x2f6
Mar 28 06:59:17 mysql kernel: [850337.524438]  [<ffffffff810cadf2>] ? __do_fault+0x54/0x3c3
Mar 28 06:59:17 mysql kernel: [850337.524455]  [<ffffffffa01702d2>] ? __ext3_journal_stop+0x1f/0x3d [ext3]
Mar 28 06:59:17 mysql kernel: [850337.524458]  [<ffffffff810cd146>] ? handle_mm_fault+0x3b8/0x80f
Mar 28 06:59:17 mysql kernel: [850337.524461]  [<ffffffff81101d8e>] ? notify_change+0x2b3/0x2c5
Mar 28 06:59:17 mysql kernel: [850337.524464]  [<ffffffff81103eb5>] ? mntput_no_expire+0x23/0xee
Mar 28 06:59:17 mysql kernel: [850337.524467]  [<ffffffff81300096>] ? do_page_fault+0x2e0/0x2fc
Mar 28 06:59:17 mysql kernel: [850337.524469]  [<ffffffff812fdf35>] ? page_fault+0x25/0x30

There are no other error messages.

Or this error can be related to memory? I already run memtest86+ on that server for several days and there was no errors.

When server was in data center, i cant boot server up. It show all the time error:

Fatal PCI Express Device Error PCI ? B00/D00/F00

After transporting it to my work, it boot up normally. In ILO event log has fallowing errors:

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 0, Function 0, Error status 0x00000000)
Uncorrectable Memory Error ((Processor 1, Memory Module 2))
Uncorrectable Memory Error ((Processor 1, Memory Module 3))
An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

I already updated bios, disk controller and drive firmwares to latest versions.

Best Answer

You have bad RAM or a system board issue. I suggest system board failure, as the Smart Array P410 controller is onboard.

The ILO messages are pretty specific. The server-side agents would probably say the same if you looked at the output of hplog -v. That's the system's IML log.

For now, I'd reseat all components and see if I could get the system to boot in a minimal configuration: one CPU, minimum installed DIMMs.

You can also download the bootable HP SmartStart .ISO and load it via ILO to run a diagnostics loop.

This is a G7 ProLiant, and the server should still be under standard warranty. Call HP.