linux – Resolving Sudden Peaks in Load and Disk Block Wait

linuxperformance

Hello superior server gurus!

I'm running a Ubuntu server that hosts an apache tomcat service along with an MySQL database. The server load is always close to zero, even during the busiest hours of the week. Despite that I am experiencing random hangups 1-2 times per week, where the entire server stops responding.

An interesting effect of this lockdown is that all cronjobs seems to be executed later than scheduled, at least that is what the timestamps in various system logs indicate. Thus it appears to me that it is indeed the entire server that freezes, not only the custom software running as a part of the tomcat service.
The hangup normally lasts for about 3-5 minutes, and afterwards everything jumps back to normal.

Hardware:
Model: Dell PowerEdge R720, 16 cores, 16 GB ram
HDD-configuration: Raid-1 (mirror)

Main services: 
apache tomcat, mysql, ssh/sftp

#uname -a
Linux es2 2.6.24-24-server #1 SMP Tue Jul 7 19:39:36 UTC 2009 x86_64 GNU/Linux

Running sysstat I can see huge peaks in both average load and disk block waits that corresponds in time exactly to when customers has reported problems with the backend system. Following is a plot of the disk usage from sar with a very obvious peak around 12.30pm.

My sincere apologies for putting this on a external server, but my rep is to low to include files here directly. Also had to put them together since i can only post one link :S

Sar plots: http://213.115.101.5/abba/tmpdata/sardata_es.jpg

Graph 1: Block wait, notice how the util% goes upp to 100% at approx 12.58

Graph 2: Block transfer, nothing unusual here.

Graph 3: Average load, peaks together with graph 1

Graph 4: CPU usage, still close to 0%.

Graph 5: Memory, nothing unusual here

Does anyone have any clue on what could cause this effect on a system? As I earlier explained the only software running on the server is a tomcat server with an SOAP interface, to allow users to connect to the database. Remote applications does also connect to the server via SSH to pull and upload files to it. At busy times im guessing that we have about 50 concurrent SSH/SFTP connections and not more than a 1-200 connections over http (soap/tomcat).

Googling around I found discussions about file handles and inode handles, but I think these are normal for 2.6.x kernals. Anyone that dissagrees?

cat /proc/sys/fs/file-nr
1152    0       1588671
cat /proc/sys/fs/inode-state
11392   236     0       0       0       0       0

At the same time "sar -v" shows these values for the time of the hangup above, but the inode-nr here is ALWAYS very high compared to above.

12:40:01    dentunusd   file-nr  inode-nr    pty-nr
12:40:01        40542      1024     15316         0
12:45:01        40568      1152     15349         0
12:50:01        40587       768     15365         0
12:55:01        40631      1024     15422         0
13:01:02        40648       896     15482         0
13:05:01        40595       768     15430         0
13:10:01        40637      1024     15465         0

I have seen this on two independent servers running the same setup of hardware, OS, software, raid-configuration etc. Thus I want to beleive that its more software/configuration dependent then hardware dependent.

Big thanks for your time
/Ebbe

Best Answer

The problems were related to a incompability issue between Ubuntu 8.04 LTS (Hardy) and the Dell PERC 6/i RAID controller, as reported in this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607167 Upgrading to Ubuntu 10.04 LTS Lucid (kernel 2.6.32) resolves the issue.

In case anyone else runs into the same issues.