Linux – Sporadic unspecific kernel panic

I'm experiencing seldom (so far about once a month) hard crashes on our ubuntu server 10.04 LTS box. The box itself is quite old (Dell PowerEdge 750 from 2004, Pentium4 2.8 GHz). I set up netconsole after it crashed twice last thursday and was able to extract the following output:

[ 9354.062473] invalid opcode: 0000 [#1] SMP
[ 9354.062516] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.0/usb2/2-2/2-2:1.0/uevent
[ 9354.062555] Modules linked in: ppdev adm1026 hwmon_vid i2c_i801 bridge stp dcdbas psmouse serio_raw netconsole configfs shpchp lp parport usbhid hid e1000
[ 9354.062685]
[ 9354.062704] Pid: 3988, comm: rsync Not tainted 2.6.38-12-generic-pae #51~lucid1-Ubuntu Dell Computer Corporation PowerEdge 750              /0R1479
[ 9354.062773] EIP: 0060:[<c104fef1>] EFLAGS: 00010046 CPU: 1
[ 9354.062802] EIP is at check_preempt_wakeup+0x181/0x250
[ 9354.062826] EAX: 00000002 EBX: f2a10ccc ECX: 00000000 EDX: 00000002
[ 9354.062850] ESI: f1db71cc EDI: f1db71a0 EBP: f1dbdea8 ESP: f1dbde8c
[ 9354.062875]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 9354.062900] Process rsync (pid: 3988, ti=f1dbc000 task=f1db71a0 task.ti=f1dbc000)
[ 9354.062933] Stack:
[ 9354.062951]  0053ea60 f7907680 f28da840 f2a10ca0 c153ea60 f7907680 c153ea60 f1dbdebc
[ 9354.063019]  c103f98a f2a10ca0 f7907680 00000001 f1dbdef8 c104f97f 00000000 f2f0bacc
[ 9354.063088]  f7904338 00000001 00000003 00000000 f2f0bacc 00000001 00000001 00000086
[ 9354.063157] Call Trace:
[ 9354.063183]  [<c103f98a>] check_preempt_curr+0x6a/0x80
[ 9354.063210]  [<c104f97f>] try_to_wake_up+0x5f/0x3f0
[ 9354.063236]  [<c1077a00>] ? hrtimer_wakeup+0x0/0x30
[ 9354.063261]  [<c104fd64>] wake_up_process+0x14/0x20
[ 9354.063286]  [<c1077a1d>] hrtimer_wakeup+0x1d/0x30
[ 9354.063310]  [<c1077f4a>] __run_hrtimer+0x7a/0x1c0
[ 9354.063336]  [<c107dbad>] ? ktime_get+0x6d/0x110
[ 9354.063360]  [<c1078310>] hrtimer_interrupt+0x120/0x2b0
[ 9354.063390]  [<c1535c36>] smp_apic_timer_interrupt+0x56/0x8a
[ 9354.063418]  [<c152f459>] apic_timer_interrupt+0x31/0x38
[ 9354.063446]  [<c1520000>] ? mca_attach_bus+0x5/0xc0
[ 9354.063469] Code: 8b 9b 20 01 00 00 8b 86 24 01 00 00 3b 83 24 01 00 00 75 e6 85 db 0f 84 a3 00 00 00 89 da 89 f0 e8 75 f6 fe ff 83 f8 01 0f 85 00 <fe> ff ff 89 f8 e8 95 f9 fe ff 8b 5e 1c 85 db 0f 84 e4 fe ff ff
[ 9354.063804] EIP: [<c104fef1>] check_preempt_wakeup+0x181/0x250 SS:ESP 0068:f1dbde8c
[ 9354.064231] ---[ end trace 290689cea65aea7f ]---
[ 9354.064290] Kernel panic - not syncing: Fatal exception in interrupt
[ 9354.064352] Pid: 3988, comm: rsync Tainted: G      D     2.6.38-12-generic-pae #51~lucid1-Ubuntu
[ 9354.064424] Call Trace:
[ 9354.064481]  [<c152c057>] ? panic+0x5c/0x15b
[ 9354.064539]  [<c15302bd>] ? oops_end+0xcd/0xd0
[ 9354.064539]  [<c100d9e4>] ? die+0x54/0x80
[ 9354.064539]  [<c152f926>] ? do_trap+0x96/0xc0
[ 9354.064539]  [<c100ba00>] ? do_invalid_op+0x0/0xa0
[ 9354.064539]  [<c100ba8b>] ? do_invalid_op+0x8b/0xa0
[ 9354.064539]  [<c104fef1>] ? check_preempt_wakeup+0x181/0x250
[ 9354.064539]  [<c144884d>] ? __kfree_skb+0x3d/0x90
[ 9354.064539]  [<c1042ae7>] ? update_curr+0x247/0x2a0
[ 9354.064539]  [<c10447bb>] ? update_cfs_load+0x11b/0x2d0
[ 9354.064539]  [<c1042a25>] ? update_curr+0x185/0x2a0
[ 9354.064539]  [<c152f6bf>] ? error_code+0x67/0x6c
[ 9354.064539]  [<c104fef1>] ? check_preempt_wakeup+0x181/0x250
[ 9354.064539]  [<c103f98a>] ? check_preempt_curr+0x6a/0x80
[ 9354.064539]  [<c104f97f>] ? try_to_wake_up+0x5f/0x3f0
[ 9354.064539]  [<c1077a00>] ? hrtimer_wakeup+0x0/0x30
[ 9354.064539]  [<c104fd64>] ? wake_up_process+0x14/0x20
[ 9354.064539]  [<c1077a1d>] ? hrtimer_wakeup+0x1d/0x30
[ 9354.064539]  [<c1077f4a>] ? __run_hrtimer+0x7a/0x1c0
[ 9354.064539]  [<c107dbad>] ? ktime_get+0x6d/0x110
[ 9354.064539]  [<c1078310>] ? hrtimer_interrupt+0x120/0x2b0
[ 9354.064539]  [<c1535c36>] ? smp_apic_timer_interrupt+0x56/0x8a
[ 9354.064539]  [<c152f459>] ? apic_timer_interrupt+0x31/0x38
[ 9354.064539]  [<c1520000>] ? mca_attach_bus+0x5/0xc0

Googling for this issue didn't really turn up anything useful (most stuff I found was related to btrfs, but I don't use that, although the module exists and is sometimes loaded). From experience it might have to do with relatively heavy I/O, as two of the panics happened during a backup procedure.

Kernel is 2.6.38-12-generic-pae, but I'm pretty sure I also saw panics on 2.6.32. I meanwhile upgraded to 3.0.0-17-generic-pae and am waiting for the next crash 😉

I'm at a loss here, so any pointers where to look for the cause or what it could be would be great 🙂 Thanks !

Update:
With the bonnie++ recommendation by @smarthall , I was able to generate another crash, this time a kernel BUG. I ran 7 instances of bonnie++. Here is the output:

[276814.620140] BUG: unable to handle kernel NULL pointer dereference at 00000259
[276814.620223] IP: [<c1553642>] __schedule+0x312/0x7f0
[276814.620274] *pdpt = 00000000333c4001 *pde = 0000000000000000 
[276814.620308] Oops: 0002 [#1] SMP 
[276814.620342] Modules linked in: ppdev adm1026 hwmon_vid i2c_i801 bridge netconsole stp dcdbas configfs psmouse lp shpchp serio_raw parport usbhid hid e1000
[276814.620490] 
[276814.620510] Pid: 8920, comm: bonnie++ Not tainted 3.0.0-17-generic-pae #30~lucid1-Ubuntu Dell Computer Corporation PowerEdge 750              /0R1479
[276814.620573] EIP: 0060:[<c1553642>] EFLAGS: 00010003 CPU: 1
[276814.620595] EIP is at __schedule+0x312/0x7f0
[276814.620614] EAX: f32da640 EBX: 000000f1 ECX: f44a8000 EDX: f44a8000
[276814.620633] ESI: f3a38400 EDI: 00000001 EBP: f39ddc04 ESP: f39ddb70
[276814.620653]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[276814.620673] Process bonnie++ (pid: 8920, ti=f39dc000 task=f32da640 task.ti=f39dc000)
[276814.620701] Stack:
[276814.620716]  00000002 f39ddbf4 00000001 f39ddbb8 f3d82640 00000000 c18aad00 c18aad00
[276814.620774]  c18aad00 c18aad00 f70420e2 e72ae300 c18aad00 f7906d00 f32da640 f44a8000
[276814.620833]  02000021 f39ddbbc c102ecb8 f39ddbd0 c15559ef c18e4640 f39ddc2c f39ddbf0
[276814.620893] Call Trace:
[276814.620917]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.620939]  [<c15559ef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[276814.620962]  [<c105e64c>] ? lock_timer_base+0x2c/0x60
[276814.620983]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.621004]  [<c15559ef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[276814.621026]  [<c1553e15>] schedule+0x35/0x50
[276814.621046]  [<c1554194>] schedule_timeout+0x134/0x260
[276814.621067]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.621090]  [<c105e680>] ? lock_timer_base+0x60/0x60
[276814.621111]  [<c15532e4>] io_schedule_timeout+0x84/0xd0
[276814.621134]  [<c10f2801>] balance_dirty_pages+0xe1/0x410
[276814.621163]  [<c115a2af>] ? __mark_inode_dirty+0x2f/0x220
[276814.621191]  [<c10f2b8e>] balance_dirty_pages_ratelimited_nr+0x5e/0x60
[276814.621222]  [<c10e9c88>] generic_perform_write+0x148/0x1b0
[276814.621251]  [<c10e9d41>] generic_file_buffered_write+0x51/0x80
[276814.621279]  [<c10ec5d4>] __generic_file_aio_write+0x1f4/0x540
[276814.621308]  [<c1075d51>] ? sched_clock_cpu+0x131/0x190
[276814.621332]  [<c1075b42>] ? sched_clock_local+0xb2/0x190
[276814.621360]  [<c10ec98a>] generic_file_aio_write+0x6a/0xd0
[276814.621387]  [<c11bbfa4>] ext4_file_write+0x54/0x290
[276814.621415]  [<c10736d2>] ? __run_hrtimer+0xa2/0x1a0
[276814.621441]  [<c1073af9>] ? hrtimer_interrupt+0x1a9/0x2b0
[276814.621469]  [<c1137ba4>] do_sync_write+0xa4/0xe0
[276814.621500]  [<c1137d28>] ? rw_verify_area+0x68/0x120
[276814.621526]  [<c1138152>] vfs_write+0xa2/0x170
[276814.621552]  [<c1137b00>] ? do_sync_readv_writev+0xe0/0xe0
[276814.621580]  [<c129ba39>] ? copy_to_user+0x39/0x130
[276814.621606]  [<c11382f2>] sys_write+0x42/0x70
[276814.621628]  [<c1555de4>] syscall_call+0x7/0xb
[276814.621654]  [<c1550000>] ? console_cpu_notify+0x24/0x29
[276814.621679] Code: 00 00 85 db 0f 84 ef 01 00 00 64 8b 3d 20 6e 8a c1 39 f3 0f 84 fa 03 00 00 64 c7 05 c4 ac 8a c1 01 00 00 00 64 89 1d c0 ac 8a c1 <f0> 0f ab bb 68 01 00 00 8b 43 28 05 00 00 00 40 0f 22 d8 8d 76 
[276814.622005] EIP: [<c1553642>] __schedule+0x312/0x7f0 SS:ESP 0068:f39ddb70
[276814.622039] CR2: 0000000000000259
[276814.622522] ---[ end trace c41db21a61fb5af0 ]---
[276814.622787] BUG: unable to handle kernel NULL pointer dereference at 00000259
[276814.622966] IP: [<c1034ecb>] leave_mm+0x1b/0x40
[276814.623084] *pdpt = 0000000000000000 *pde = f000adfef000adfe 
[276814.623248] Oops: 0002 [#2] SMP 
[276814.623404] Modules linked in: ppdev adm1026 hwmon_vid i2c_i801 bridge netconsole stp dcdbas configfs psmouse lp shpchp serio_raw parport usbhid hid e1000
[276814.624015] 
[276814.624015] Pid: 8920, comm: bonnie++ Tainted: G      D     3.0.0-17-generic-pae #30~lucid1-Ubuntu Dell Computer Corporation PowerEdge 750              /0R1479
[276814.624015] EIP: 0060:[<c1034ecb>] EFLAGS: 00010202 CPU: 1
[276814.624015] EIP is at leave_mm+0x1b/0x40
[276814.624015] EAX: 00000001 EBX: f3a38400 ECX: bfe84000 EDX: 000000f1
[276814.624015] ESI: f39dd964 EDI: f39dd964 EBP: f39dd918 ESP: f39dd918
[276814.624015]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[276814.624015] Process bonnie++ (pid: 8920, ti=f39dc000 task=f32da640 task.ti=f39dc000)
[276814.624015] Stack:
[276814.624015]  f39dd928 c103503b f39737e8 f39dd964 f39dd93c c1108c8d f39737e8 f39dd964
[276814.624015]  f39dd964 f39dd94c c11090f1 f39737e8 f3a38400 f39dd9b0 c1110d1e 00000000
[276814.624015]  f39dd9a0 00000000 bfe84000 f3a38400 f39dd9ec 00000001 f14be000 f14be000
[276814.624015] Call Trace:
[276814.624015]  [<c103503b>] flush_tlb_mm+0x7b/0x80
[276814.624015]  [<c1108c8d>] tlb_flush_mmu+0x3d/0x70
[276814.624015]  [<c11090f1>] tlb_finish_mmu+0x11/0x40
[276814.624015]  [<c1110d1e>] exit_mmap+0xbe/0x120
[276814.624015]  [<c104e71c>] mmput+0x4c/0xf0
[276814.624015]  [<c1054490>] exit_mm+0x100/0x130
[276814.624015]  [<c1555a28>] ? _raw_spin_lock_irq+0x18/0x20
[276814.624015]  [<c10546d9>] do_exit+0x139/0x380
[276814.624015]  [<c1553139>] ? printk+0x2d/0x34
[276814.624015]  [<c1557095>] oops_end+0x95/0xd0
[276814.624015]  [<c102fb04>] no_context+0xc4/0xe0
[276814.624015]  [<c102fbb8>] __bad_area_nosemaphore+0x98/0x140
[276814.624015]  [<c102fcc0>] bad_area+0x40/0x50
[276814.624015]  [<c15592ee>] do_page_fault+0x3de/0x430
[276814.624015]  [<c10494a4>] ? load_balance+0x74/0x400
[276814.624015]  [<c1558f10>] ? spurious_fault+0x130/0x130
[276814.624015]  [<c15564cf>] error_code+0x67/0x6c
[276814.624015]  [<c1553642>] ? __schedule+0x312/0x7f0
[276814.624015]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.624015]  [<c15559ef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[276814.624015]  [<c105e64c>] ? lock_timer_base+0x2c/0x60
[276814.624015]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.624015]  [<c15559ef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[276814.624015]  [<c1553e15>] schedule+0x35/0x50
[276814.624015]  [<c1554194>] schedule_timeout+0x134/0x260
[276814.624015]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[276814.624015]  [<c105e680>] ? lock_timer_base+0x60/0x60
[276814.624015]  [<c15532e4>] io_schedule_timeout+0x84/0xd0
[276814.624015]  [<c10f2801>] balance_dirty_pages+0xe1/0x410
[276814.624015]  [<c115a2af>] ? __mark_inode_dirty+0x2f/0x220
[276814.624015]  [<c10f2b8e>] balance_dirty_pages_ratelimited_nr+0x5e/0x60
[276814.624015]  [<c10e9c88>] generic_perform_write+0x148/0x1b0
[276814.624015]  [<c10e9d41>] generic_file_buffered_write+0x51/0x80
[276814.624015]  [<c10ec5d4>] __generic_file_aio_write+0x1f4/0x540
[276814.624015]  [<c1075d51>] ? sched_clock_cpu+0x131/0x190
[276814.624015]  [<c1075b42>] ? sched_clock_local+0xb2/0x190
[276814.624015]  [<c10ec98a>] generic_file_aio_write+0x6a/0xd0
[276814.624015]  [<c11bbfa4>] ext4_file_write+0x54/0x290
[276814.624015]  [<c10736d2>] ? __run_hrtimer+0xa2/0x1a0
[276814.624015]  [<c1073af9>] ? hrtimer_interrupt+0x1a9/0x2b0
[276814.624015]  [<c1137ba4>] do_sync_write+0xa4/0xe0
[276814.624015]  [<c1137d28>] ? rw_verify_area+0x68/0x120
[276814.624015]  [<c1138152>] vfs_write+0xa2/0x170
[276814.624015]  [<c1137b00>] ? do_sync_readv_writev+0xe0/0xe0
[276814.624015]  [<c129ba39>] ? copy_to_user+0x39/0x130
[276814.624015]  [<c11382f2>] sys_write+0x42/0x70
[276814.624015]  [<c1555de4>] syscall_call+0x7/0xb
[276814.624015]  [<c1550000>] ? console_cpu_notify+0x24/0x29
[276814.624015] Code: eb 90 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5 3e 8d 74 26 00 64 8b 15 c4 ac 8a c1 83 fa 01 74 1c 64 8b 15 c0 ac 8a c1 <f0> 0f b3 82 68 01 00 00 b8 00 50 8b 01 0f 22 d8 8d 76 00 5d c3 
[276814.624015] EIP: [<c1034ecb>] leave_mm+0x1b/0x40 SS:ESP 0068:f39dd918
[276814.624015] CR2: 0000000000000259
[276875.684006] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=15002 jiffies)

Again the D flag is in there, but I definitely had no oops coming through netconsole between bootup and this BUG.

Update 2:

Unfortunately the crash / oops / bug is not reproducible (as I hoped). For the last two days I ran bonnie++ extensively, but the machine didn't crash again. I just got a few (ca 10) messages like this one while bonnie++ ran:

[25560.303064] INFO: task postgres:980 blocked for more than 120 seconds.
[25560.303084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25560.303115] postgres        D f315fca4     0   980    888 0x00000000
[25560.303144]  f315fcb4 00000086 00000002 f315fca4 06300066 f4023f50 06300066 00000000
[25560.303203]  c18aad00 c18aad00 c18aad00 c18aad00 f315fc50 c10f47da c18aad00 f7906d00
[25560.303262]  f320d940 f44a8000 f4023e80 f4023f9c f5826cc0 dbfe2c40 f4023e00 00000000
[25560.303336] Call Trace:
[25560.303356]  [<c10f47da>] ? lru_cache_add_lru+0x2a/0x50
[25560.303378]  [<c1160484>] ? __find_get_block+0x84/0x1d0
[25560.303401]  [<c102ecb8>] ? default_spin_lock_flags+0x8/0x10
[25560.303424]  [<c15559ef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[25560.303447]  [<c1553e15>] schedule+0x35/0x50
[25560.303468]  [<c1200805>] do_get_write_access+0x215/0x3e0
[25560.303491]  [<c106f4f0>] ? autoremove_wake_function+0x50/0x50
[25560.303514]  [<c1200ae8>] jbd2_journal_get_write_access+0x28/0x40
[25560.303547]  [<c11e847f>] __ext4_journal_get_write_access+0x2f/0x70
[25560.303570]  [<c11c23a7>] ext4_reserve_inode_write+0x67/0x90
[25560.304894]  [<c11c2416>] ext4_mark_inode_dirty+0x46/0x1e0
[25560.304923]  [<c11d9ae1>] ? ext4_journal_start_sb+0x51/0x120
[25560.304950]  [<c11c26d1>] ? ext4_dirty_inode+0x31/0x50
[25560.304977]  [<c11c26d1>] ext4_dirty_inode+0x31/0x50
[25560.305002]  [<c115a2af>] __mark_inode_dirty+0x2f/0x220
[25560.305029]  [<c1035e48>] ? __kunmap_atomic+0x88/0x90
[25560.305066]  [<c114e4c8>] touch_atime+0xf8/0x140
[25560.305094]  [<c10ebf1a>] T.1028+0x40a/0x4a0
[25560.305120]  [<c10ec07b>] generic_file_aio_read+0xcb/0x2c0
[25560.305149]  [<c1137c84>] do_sync_read+0xa4/0xe0
[25560.305175]  [<c1137d28>] ? rw_verify_area+0x68/0x120
[25560.305201]  [<c11383bf>] vfs_read+0x9f/0x170
[25560.305225]  [<c1137be0>] ? do_sync_write+0xe0/0xe0
[25560.305250]  [<c1138562>] sys_read+0x42/0x70
[25560.305277]  [<c1555de4>] syscall_call+0x7/0xb here

I'm unsure if those messages are indicating anything.

I also ran memtest86+ for a night (11 passes) without a single error.

The most interesting occurence was a "rm" process stuck in "D" mode (uninterruptable sleep) yesterday. I noticed bonnie++ had left a few files on the disk after completing (about 5 6GB files and 5 empty files) and tried to delete those. It deleted 8 of 10 files, but then rm went into uninterruptible sleep and wasn't killable at all (which is intended afaiu). I left it there for quite a while (ca 2h) before rebooting – the logs didn't reveal anything unusual.

Probably I should mention that the disk setup is two HDDs below a simple LVM setup (boot on a non-LVM primary partition on the first disk, root and swap on LVM). The first disk is 120GB, the second 1TB. The smart tests also finished without error so far (the extended test is still running on the first disk).

Best Answer

The error seems to be from an 'invalid opcode' of 0000 which means the kernel tried to jump to execute memory that was zero'd. You kernel is tainted with 'D' so this means you've had oopses recently, you should look for these in /var/log/messages and elsewhere as they may be indicators the actual root cause.

I would say that you are likely right about it happening on high IO as it happened while the rsync process was on the CPU. That doesn't make it certain to be IO related, but it does raise eyebrows.

Given the age of the machine you could be facing a hardware problem. Have you tried running memtest on the machine? Or running an IO benchmark (like bonnie++) to see if that causes the crashes to happen?

Best Answer

Related Solutions

Kernel Panic – How to Automatically Reboot After Kernel Panic

Linux – SATA drive problems with two SIL RAID cards

Related Topic