Linux – CentOS server high load and server freezes

centos5high-loadlinuxlinux-kernel

My server seems to get sudden high loads within second and when I run dmesg, following logs appear:

INFO: task auditd:2185 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
auditd        D 0000012D  2216  2185      1          2216  2184 (NOTLB)
       f7fcaed0 00000086 eb3e3159 0000012d 0000012c 0000000e 00000000 00000007
       c35e3550 eb3e396f 0000012d 00000816 00000002 c35e365c c3420788 f7897200
       c36d0468 00000000 00000000 f7fcaecc c041f0c8 00000000 00000000 00000003
Call Trace:
 [<c041f0c8>] __wake_up+0x2a/0x3d
 [<c043753f>] prepare_to_wait+0x24/0x46
 [<f885f2f1>] log_wait_commit+0x80/0xc7 [jbd]
 [<c04373f3>] autoremove_wake_function+0x0/0x2d
 [<f885a680>] journal_stop+0x196/0x1bb [jbd]
 [<c04968da>] __writeback_single_inode+0x199/0x2a5
 [<c045d7b0>] do_writepages+0x2b/0x32
 [<c0459283>] __filemap_fdatawrite_range+0x66/0x72
 [<c0496f78>] sync_inode+0x19/0x24
 [<f8892019>] ext3_sync_file+0xb1/0xdc [ext3]
 [<c0479211>] do_fsync+0x41/0x83
 [<c0479270>] __do_fsync+0x1d/0x2b
 [<c0404ee1>] sysenter_past_esp+0x56/0x79

I'm really struggling to understand what is wrong, following are the contents on /proc/meminfo

# cat /proc/meminfo 
MemTotal:      4148160 kB
MemFree:        119352 kB
Buffers:         14024 kB
Cached:        3362784 kB
SwapCached:         84 kB
Active:        2608268 kB
Inactive:      1217900 kB
HighTotal:     3273304 kB
HighFree:         9092 kB
LowTotal:       874856 kB
LowFree:        110260 kB
SwapTotal:     4096552 kB
SwapFree:      4096456 kB
Dirty:           50816 kB
Writeback:      270996 kB
AnonPages:      449592 kB
Mapped:         889840 kB
Slab:           154948 kB
PageTables:      32796 kB
NFS_Unstable:        0 kB
Bounce:            624 kB
CommitLimit:   6170632 kB
Committed_AS:  2463988 kB
VmallocTotal:   116728 kB
VmallocUsed:      6728 kB
VmallocChunk:   109892 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

Following is how the load looks like, it goes sometime even higher than this:

Tasks: 506 total,   1 running, 504 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 11.7%us,  1.9%sy,  0.0%ni, 86.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  1.9%sy,  0.0%ni,  0.0%id, 96.1%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu3  :  7.8%us,  1.0%sy,  0.0%ni, 91.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4148160k total,  4033552k used,   114608k free,    13944k buffers
Swap:  4096552k total,       96k used,  4096456k free,  3361112k cached

Nothing seems to wrong with the disks as well:

# /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sda
# /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sdb
<nooutput>

I don't know how to debug this issue, if a process is causing this then how to find which process can be doing that or if some kernel parameter needs tuning then I don't know which kernel parameter shall I tune.

Best Answer

Cpu2  :  1.0%us,  1.9%sy,  0.0%ni,  0.0%id, 96.1%wa,  0.0%hi,  1.0%si,  0.0%st

This is the most prominent performance problem that I figured out. Look out for the wa field, huge. That CPU is handling a huge chunk of IO.

The stack trace shows that it goes to log the metadata in the journal and then it gets stuck while waiting. Perhaps some other process has already grabbed the lock and not releasing it, resulting in a dead lock and the other process is spinning and spinning and contributing to the increased load average.

What you need to do is to collect fairly exhaustive data.

iostat -xdk 1 100

And usual sar data. Also, post the scheduler and the lun queue depth of the hard disks.