Linux – HP DL380 G5 – Smart Array P400 – Linux hang with high load randomly

hp-prolianthp-smart-arraylinuxpostfix

Since 2-3 weeks now, my main server is hanging for no apparent reasons. Was working without issues for more than 4 months in a row before that. Everytime, a simple reboot fix the problem.

Current setup :

  • HP DL380 G5, 2 x Xeon 4C 3GHz, 16GB memory, 6 x 146GB in RAID 0+1
  • Slackware 14.0

I leave the server with PuTTy openned and top running, when it hang (about 1 to 3 times a day), I see a high load, about more than 60, all web services (HTTP, DNS, SMTP, IMAP, POP3, etc) are unresponsive. When connecting with PuTTy, I'm able to log, but the prompt never appears, same thing on local prompt (keyboard + screen). Also, I've seen that green LEDs on drives are flashing simultaneously at a frequency of about 0.5Hz – 1Hz (normally they flash way faster and in random order).

I first suspected DDoS attacks, etc, added many fail2ban validations, external firewall TCP requests limitations, etc. After, I verified firmwares versions (including P400), upgraded all to latest versions, the problem still occurs. I've also synched root to another DL380 G5 (same hardware except 4 x 450GB drives) to replace the server, same issue again.

I verified using top, iostat, iotop still no clue. When load is high, there is almost no CPU usage (top) and no disk activity (iostat).

Now I'm wondering if it could be the CCISS driver that could have an issue in the version I'm using?

Here are some informations that may be useful :

Controler details :

root@hyperion:~# hpapucli

=> ctrl all show status

Smart Array P400 in Slot 1
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK

=> ctrl all show detail

Smart Array P400 in Slot 1
Bus Interface: PCI
Slot: 1
Serial Number: P61620G9SVM38V
Cache Serial Number: PA2270H9SVI198
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: D
Firmware Version: 6.86
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
Total Cache Memory Available: 464 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True

=> ctrl all show config

Smart Array P400 in Slot 1 (sn: P61620G9SVM38V)

array A (SAS, Unused Space: 0 MB)


logicaldrive 1 (838.3 GB, RAID 1+0, OK)

physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 450 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 450 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 450 GB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 450 GB, OK)

Driver details :

root@hyperion:~# modinfo cciss
filename: /lib/modules/3.2.29/kernel/drivers/block/cciss.ko
license: GPL
version: 3.6.26
description: Driver for HP Smart Array Controllers
author: Hewlett-Packard Company
srcversion: D553A90CDE37829B37A9C27
alias: pci:v0000103Cd00003230sv0000103Csd0000323Dbc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003237bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003215bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003214bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003213bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003212bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003211bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003235bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003234bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003223bc*sc*i*
alias: pci:v0000103Cd00003220sv0000103Csd00003225bc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Dbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Cbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Bbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Abc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd00004091bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004083bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004082bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004080bc*sc*i*
alias: pci:v00000E11d0000B060sv00000E11sd00004070bc*sc*i*
depends:
intree: Y
vermagic: 3.2.29 SMP mod_unload
parm: cciss_tape_cmds:number of commands to allocate for tape devices (default: 6) (int)
parm: cciss_simple_mode:Use 'simple mode' rather than 'performant mode' (int)

top output when hanging

top - 10:39:45 up 43 min,  2 users,  load average: 24.58, 7.14, 2.88
Tasks: 282 total,   1 running, 281 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32894436k total, 17964512k used, 14929924k free,    97732k buffers
Swap:        0k total,        0k used,        0k free, 10694424k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3928 root      20   0 37164 2988 2444 S    0  0.0   0:00.41 sshd
 4478 root      20   0 17608 1540 1060 R    0  0.0   0:07.62 top
    1 root      20   0  4316  696  600 S    0  0.0   0:00.98 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/0
    5 root      20   0     0    0    0 S    0  0.0   0:00.02 kworker/u:0
    6 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    7 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/1
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/2
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/3
   16 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/3
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/4
   19 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/4
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/5
   22 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/5
   23 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/6
   25 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/6
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/7
   28 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/7
   29 root       0 -20     0    0    0 S    0  0.0   0:00.00 cpuset
   30 root       0 -20     0    0    0 S    0  0.0   0:00.00 khelper
   31 root      20   0     0    0    0 S    0  0.0   0:00.00 kdevtmpfs
   32 root       0 -20     0    0    0 S    0  0.0   0:00.00 netns
   33 root      20   0     0    0    0 S    0  0.0   0:00.00 kworker/u:1
  495 root      20   0     0    0    0 D    0  0.0   0:05.24 sync_supers
  497 root      20   0     0    0    0 S    0  0.0   0:00.00 bdi-default
  499 root       0 -20     0    0    0 S    0  0.0   0:00.00 kblockd
  654 root       0 -20     0    0    0 S    0  0.0   0:00.00 ata_sff
  661 root      20   0     0    0    0 S    0  0.0   0:00.00 khubd
  667 root       0 -20     0    0    0 S    0  0.0   0:00.00 md
  676 root      20   0     0    0    0 S    0  0.0   0:00.40 kworker/3:1
  677 root      20   0     0    0    0 S    0  0.0   0:00.12 kworker/4:1
  678 root      20   0     0    0    0 S    0  0.0   0:00.65 kworker/5:1
  679 root      20   0     0    0    0 S    0  0.0   0:00.16 kworker/6:1
  680 root      20   0     0    0    0 S    0  0.0   0:00.21 kworker/7:1
  774 root       0 -20     0    0    0 S    0  0.0   0:00.00 rpciod
  826 root      20   0     0    0    0 S    0  0.0   0:00.00 khungtaskd
  832 root      20   0     0    0    0 S    0  0.0   0:00.00 kswapd0

DL380 G6 with P410i migration

I also tried with another HP server by moving hard drives directly and changing /dev/cciss/c0d0* by /dev/sda* in /etc/fstab and /etc/lilo.conf, still the same issue.

Controler details :

Note : Yes the cache is disabled, I simply have no battery for that server now.

root@hyperion:~# modprobe sg
root@hyperion:~# hpacucli ctrl all show detail

Smart Array P410i in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   Serial Number: 50123456789ABCDE
   Cache Serial Number: PAAVP9VYBAU0
   RAID 6 (ADG) Status: Disabled
   Controller Status: OK
   Hardware Revision: C
   Firmware Version: 6.64
   Rebuild Priority: Medium
   Expand Priority: Medium
   Surface Scan Delay: 15 secs
   Surface Scan Mode: Idle
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Cache Ratio: 100% Read / 0% Write
   Drive Write Cache: Disabled
   Total Cache Size: 512 MB
   Total Cache Memory Available: 400 MB
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
   SATA NCQ Supported: True

Driver details :

root@hyperion:~# modinfo hpsa
filename:       /lib/modules/3.2.29/kernel/drivers/scsi/hpsa.ko
license:        GPL
version:        2.0.2-1
description:    Driver for HP Smart Array Controller version 2.0.2-1
author:         Hewlett-Packard Company
srcversion:     624DA19A5286F6BDA1645F3
alias:          pci:v0000103Cd*sv*sd*bc01sc04i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003356bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003355bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003354bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003353bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003352bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003351bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003350bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003233bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd0000324Bbc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd0000324Abc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003249bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003247bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003245bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003243bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003241bc*sc*i*
depends:
intree:         Y
vermagic:       3.2.29 SMP mod_unload
parm:           hpsa_allow_any:Allow hpsa driver to access unknown HP Smart Array hardware (int)
parm:           hpsa_simple_mode:Use 'simple mode' rather than 'performant mode' (int)

Possible cause

Yesterday by doing tests on different processes, I disabled postfix and the server stopped hanging. As soon as I started it again, the server hanged. Looks like there is a bad configuration or suspicious smtp requests are made.

Best Answer

The HP ProLiant G5 server series is pretty old equipment and out of support from every reasonable perspective. This equipment went end-of-life in 2009.

However, if you don't mind being unsupported and the fact that the system is 4 generations and old, the server can still function.

For your situation, you're working with a bad revision of firmware on the RAID controller. I recommend you update the firmware of your RAID controller to the most current release (2012).

Normally, you could do this from within the operating system, but Slackware is totally unsupported by HP as well. If you can find a way to update firmware, this will likely resolve the issue.


enter image description here