Linux – Server freezes and suddenly load going very high

centoskernellinuxlinux-kernel

I'm on CentOS 5.7 and I get server freezes quite a lot and the logs show the following output:

Dec 26 18:33:51 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 18:34:00 server init: Switching to runlevel: 6
Dec 26 18:34:00 server kernel: pdflush       D 0000098C  2492   206     19           207   204 (L-TLB)
Dec 26 18:34:00 server kernel:        f7c02e04 00000046 6e39208c 0000098c f138e27c 0000003b cee0e5c0 0000000a 
Dec 26 18:34:00 server kernel:        c35ff550 6e3a00e8 0000098c 0000e05c 00000001 c35ff65c c3419944 f7aca740 
Dec 26 18:34:00 server kernel:        00000001 00000000 c041f0c8 00000000 0140f282 c042d86b f7f1e81c ffffffff 
Dec 26 18:34:00 server kernel: Call Trace:
Dec 26 18:34:00 server kernel:  [<c041f0c8>] __wake_up+0x2a/0x3d
Dec 26 18:34:00 server kernel:  [<c042d86b>] getnstimeofday+0x30/0xb6
Dec 26 18:34:00 server kernel:  [<c06242f8>] io_schedule+0x36/0x59
Dec 26 18:34:00 server kernel:  [<c04586d2>] sync_page+0x0/0x3b
Dec 26 18:34:00 server kernel:  [<c045870a>] sync_page+0x38/0x3b
Dec 26 18:34:00 server kernel:  [<c062440a>] __wait_on_bit_lock+0x2a/0x52
Dec 26 18:34:00 server kernel:  [<c045864d>] __lock_page+0x52/0x59
Dec 26 18:34:00 server kernel:  [<c0437420>] wake_bit_function+0x0/0x3c
Dec 26 18:34:00 server kernel:  [<c049811e>] mpage_writepages+0x135/0x308
Dec 26 18:34:00 server kernel:  [<f8895bb6>] ext3_ordered_writepage+0x0/0x166 [ext3]
Dec 26 18:34:00 server kernel:  [<c045d7b0>] do_writepages+0x2b/0x32
Dec 26 18:34:00 server kernel:  [<c04968a7>] __writeback_single_inode+0x166/0x2a5
Dec 26 18:34:00 server kernel:  [<c0496cca>] sync_sb_inodes+0x17e/0x221
Dec 26 18:34:00 server kernel:  [<c0496f19>] writeback_inodes+0x6a/0xb0
Dec 26 18:34:00 server kernel:  [<c045dbac>] background_writeout+0x71/0xc3
Dec 26 18:34:00 server kernel:  [<c045e10f>] pdflush+0x0/0x1a1
Dec 26 18:34:00 server kernel:  [<c045e21a>] pdflush+0x10b/0x1a1
Dec 26 18:34:00 server kernel:  [<c045db3b>] background_writeout+0x0/0xc3
Dec 26 18:34:00 server kernel:  [<c043732e>] kthread+0xc0/0xee
Dec 26 18:34:00 server kernel:  [<c043726e>] kthread+0x0/0xee
Dec 26 18:34:00 server kernel:  [<c0405c87>] kernel_thread_helper+0x7/0x10
Dec 26 18:34:00 server kernel:  =======================

What can cause these messages? and how can I fix this issues?

Best Answer

Could be due to thousands of reasons. This hung_task parameter was introduced since RHEL 5.5.

You should not disable it, you would miss important stack trace and debugging options. Here, it shows that there was some problem with page writeback in ext3 filesystem and the page that was being written was locked. The task responsible for writing the page was pdflush and it got into D state, meaning waiting for IO to be completed. Until the IO is complete, it cannot be interrupted, as it is in D state. When pdflush is going into D state, freezing of server is quite natural as it is the kernel thread responsible for writing out dirty pages into disk.

So, possible clues. You are writing too much dirty data, check your memory condition. Find /proc/meminfo to know this.

If you are not writing too much dirty data, then could be other problems. The stack trace doesn't indicate much other than this. Do you have other traces.

If you have server support you can do this echo 1 > /proc/sys/kernel/hung_task_panic. This will create a vmcore next time the hung task timeout is reached. You need to set kdump for this. Follow Red Hat articles or any respectable linux blogs to do this. From vmcore, the exact reason can be found out. Other than that, it is just seeing the trace and guessing things.