Linux Kernel crash mutex_lock_slowpath “blocked for more than 120 seconds”. What to do

kernellinux

I have out-of-the box Debian Lenny with non-custom kernel 2.6.26-2-amd64.
Brand new server that is used to 5% of it's potential, CPU and Disk-wise.
Meaning it probably not crashing because of overload.

every few days it freezes with hundreds of these messages in console log:

: [284847.828428] INFO: task apache2:12473 blocked for more than 120 seconds.
: [284847.868468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [284847.912759] apache2       D ffff8101bc6b7ab0     0 12473  14358
: [284847.912763]  ffff810160d5bc50 0000000000000082 ffff8101c0002e40 0000000000000000
: [284847.912766]  ffff8101a7c42950 ffff810327d92810 ffff8101a7c42bd8 0000000400000044
: [284847.912770]  ffff8101c0002e40 00000000000612d0 0000000000000000 00000040000612d0
: [284847.912773] Call Trace:
: [284847.912786]  [<ffffffff80429b0d>] __mutex_lock_slowpath+0x64/0x9b
: [284847.912790]  [<ffffffff80429972>] mutex_lock+0xa/0xb
: [284847.912794]  [<ffffffff802a20b9>] do_lookup+0x82/0x1c1
: [284847.912800]  [<ffffffff802a4271>] __link_path_walk+0x87a/0xd19
: [284847.912805]  [<ffffffff80295844>] kmem_getpages+0x96/0x15f
: [284847.912808]  [<ffffffff80295fb7>] ____cache_alloc_node+0x6d/0x106
: [284847.912814]  [<ffffffff802a4756>] path_walk+0x46/0x8b
: [284847.912819]  [<ffffffff802a4a82>] do_path_lookup+0x158/0x1cf
: [284847.912822]  [<ffffffff802a3879>] getname+0x140/0x1a7
: [284847.912827]  [<ffffffff802a53f1>] __user_walk_fd+0x37/0x4c
: [284847.912831]  [<ffffffff8029e381>] vfs_lstat_fd+0x18/0x47
: [284847.912840]  [<ffffffff8029e3c9>] sys_newlstat+0x19/0x31
: [284847.912848]  [<ffffffff8020beda>] system_call_after_swapgs+0x8a/0x8f

Almost all traces has __mutex_lock_slowpath as top-level.

Only some has different trace:

: [284847.737386] INFO: task apache2:12472 blocked for more than 120 seconds.
: [284847.777551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [284847.824881] apache2       D ffff8101bc6b7ab0     0 12472  14358
: [284847.824886]  ffff8101b9cc1c50 0000000000000086 ffffffffa0131e0a 0000000000000002
: [284847.824889]  ffff8102e7454300 ffff810324c6cad0 ffff8102e7454588 0000000000000000
: [284847.824893]  0000000000000001 0000000000000296 0000000000000003 ffff8101b9cc1c58
: [284847.824896] Call Trace:
: [284847.828403]  [<ffffffffa0131e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
: [284847.828412]  [<ffffffff80429b0d>] __mutex_lock_slowpath+0x64/0x9b
: [284847.828418]  [<ffffffff80429972>] mutex_lock+0xa/0xb
: [284847.828421]  [<ffffffff802a20b9>] do_lookup+0x82/0x1c1
: [284847.828427]  [<ffffffff802a4271>] __link_path_walk+0x87a/0xd19
: [284847.828428]  [<ffffffff80271296>] find_lock_page+0x1f/0x8a
: [284847.828428]  [<ffffffff80273182>] filemap_fault+0x1c2/0x33c
: [284847.828428]  [<ffffffff802a4756>] path_walk+0x46/0x8b
: [284847.828428]  [<ffffffff802a4a82>] do_path_lookup+0x158/0x1cf
: [284847.828428]  [<ffffffff802a3879>] getname+0x140/0x1a7
: [284847.828428]  [<ffffffff802a53f1>] __user_walk_fd+0x37/0x4c
: [284847.828428]  [<ffffffff8029e381>] vfs_lstat_fd+0x18/0x47
: [284847.828428]  [<ffffffff8029e3c9>] sys_newlstat+0x19/0x31
: [284847.828428]  [<ffffffff8020beda>] system_call_after_swapgs+0x8a/0x8f
 kernel: [1912668.466347] INFO: task apache2:17984 blocked for more than 120 seconds.
  [1912668.507035] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
: [1912668.555165] apache2       D ffff8101c5637ba0     0 17984  17282
: [1912668.596752]  ffff810166a7dd30 0000000000000086 0000000000000000 ffff810166a7dcd8
: [1912668.643341]  ffff8101c563c880 ffff81024505f000 0000000000000002 ffff810166a7dd68
: [1912668.699566]  0000000000000086 00000000000cb1a0 0000000000000000 ffff81017f344d60
: [1912668.744773] Call Trace:
: [1912668.761754]  [<ffffffff8022a3ed>] pick_next_task_fair+0x6e/0x7a
: [1912668.829311]  [<ffffffff802be0e2>] bio_alloc_bioset+0x89/0xd9
: [1912668.861930]  [<ffffffff8024ac3a>] getnstimeofday+0x39/0x98
: [1912668.897005]  [<ffffffff802710f6>] sync_page+0x0/0x41
: [1912668.927868]  [<ffffffff80429487>] io_schedule+0x5c/0x9e
: [1912668.960286]  [<ffffffff80271132>] sync_page+0x3c/0x41
: [1912668.991756]  [<ffffffff804295fa>] __wait_on_bit_lock+0x36/0x66
: [1912669.031757]  [<ffffffff802710e3>] __lock_page+0x5e/0x64
: [1912669.064191]  [<ffffffff802461d3>] wake_bit_function+0x0/0x23
: [1912669.100100]  [<ffffffff80281bc5>] handle_mm_fault+0x5e4/0x8de
: [1912669.134531]  [<ffffffff802461a5>] autoremove_wake_function+0x0/0x2e
: [1912669.174623]  [<ffffffff802aa108>] fcntl_setlk+0x1cf/0x291
: [1912669.210623]  [<ffffffff802461a5>] autoremove_wake_function+0x0/0x2e
: [1912669.246923]  [<ffffffff802a677f>] sys_fcntl+0x280/0x2f7

After googling for "mutex_lock_slowpath" I can only find the Kernel mailing list discussions that this issue was introduced in some commit. Wthout reference to verison.
Discussions as recent as Jan 25, 2011.
The Kernel I am using is form Debian Lenny, year ago.

What should I do? Is this bug even fixed in kernel? if it's such obvious bug why it happens so rarely?

Should I download latest kernel from kernel.org and upgrade?
Should I use Debian backports to install new "Approved" kernel?

Am I missing something? What to do?

Best Answer

Are you using an SSD drive by any chance?

I've seen these errors on my Ubuntu 10.10 system. What would happen was that there would be a SATA glitch that would completely mess up the disk subsystem. Subsequent attempts at disk I/O would result in 120-second timeouts similar to yours (the stack trace varied.)

I've documented the original problem here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/707583

The (somewhat lame) question I asked about the timeouts is here: Mystery stack traces in /var/log/messages

I was using a P55-based motherboard and Crucial SSD drives. However, I've seen this reported with other chipsets, other makes of SSD drives, and other Linux kernels.

As far as I can tell, the only thing in common was the use of SSD drives.

Related Solutions

Httpd – Log – Server kernel: INFO: task httpd:000000 blocked for more than 120 seconds

This is because of a mutex lock.

Check the stack trace printed carefully.It goes upside down. You will find this line

mutex_lock_slowpath

Seems there is a resource crunch.

Sysstat as suggested is a good profiling tool in most cases. If you need to go to the root of the issue, then you will require a vmcore or kernel memory dump. There are two /proc files called

/proc/sys/kernel/hung_task_timeout_secs
/proc/sys/kernel/hung_task_panic

The value of the first file is 120. That is why you are seeing messages that the task is blocked for 120 seconds. A trivial test is to increase it and see what happens. Make it 240 or 360.

Next file by default has a value of 0. This needs to be 1 if you want to collect a vmcore.

Obviously, you need to set up kdump and fix the dump target. The dump target should be larger than the physical memory size. But even if you collect the vmcore, you will need some C, assembly and general debugging knowledge to get a hang of it. A professional support or sysadmin can help better.

But imo, changing elevator won't affect anything here.

Ubuntu – task blocked for more than 120 seconds

You can install a new kernel upgrade on an Ubuntu instance on EC2 using:

sudo apt-get update &&
sudo apt-get dist-upgrade -y

Then, if all went well:

sudo reboot

The most recent released kernel for Ubuntu 10.10 on EC2 is

2.6.35-31-virtual

You might also consider upgrading to a more recent version of Ubuntu which comes with an even newer kernel. For example, Ubuntu 11.10 Oneiric currently runs this kernel after dist-upgrade:

3.0.0-14-virtual

Ubuntu 10.10 Maverick, which you are running, reaches its end of life in a few months (April 2012), so you'll need to upgrade to a newer release anyways.

Here's a bug report that may relate to the problem you're seeing:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/666211

Best Answer

Related Solutions

Httpd – Log – Server kernel: INFO: task httpd:000000 blocked for more than 120 seconds

Ubuntu – task blocked for more than 120 seconds

Related Topic