Linux – Kernel oops on Linux running in VirtualBox breaks some IO-related functionality on the server

kernellinuxserver-crashesvirtualbox

We are having problems with CentOS release 6.3 running in VirtualBox on Windows 7 machines. The symptoms are the following:

Everything works as normal for several hours, even days. Then something happens which breaks the system.

What we still can do after this something happens:

  • Access the web server
  • Use existing SSH sessions to run top and free

What does not work:

  • Starting new SSH sessions (hangs after username and password is entered)
  • Running ls in existing SSH sessions (hangs)
  • SSI includes from our web servers that fetch data from remote machines
  • probably more

What we see on the server when this something happens is the following:

  • Load average go from basically nothing to around 3
  • CPU usage is still low (5%)
  • Disk activity is low (running iostat)
  • Plenty of memory available
  • Plenty of disk space available

In /var/log/messages we get the following:

Jun 14 01:10:48 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:48 devvm kernel:  Tx Queue             <0>
Jun 14 01:10:48 devvm kernel:  TDH                  <2e>
Jun 14 01:10:48 devvm kernel:  TDT                  <30>
Jun 14 01:10:48 devvm kernel:  next_to_use          <30>
Jun 14 01:10:48 devvm kernel:  next_to_clean        <2e>
Jun 14 01:10:48 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:48 devvm kernel:  time_stamp           <1038284db>
Jun 14 01:10:48 devvm kernel:  next_to_watch        <2f>
Jun 14 01:10:48 devvm kernel:  jiffies              <103828b42>
Jun 14 01:10:48 devvm kernel:  next_to_watch.status <0>
Jun 14 01:10:50 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:50 devvm kernel:  Tx Queue             <0>
Jun 14 01:10:50 devvm kernel:  TDH                  <2e>
Jun 14 01:10:50 devvm kernel:  TDT                  <30>
Jun 14 01:10:50 devvm kernel:  next_to_use          <30>
Jun 14 01:10:50 devvm kernel:  next_to_clean        <2e>
Jun 14 01:10:50 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:50 devvm kernel:  time_stamp           <1038284db>
Jun 14 01:10:50 devvm kernel:  next_to_watch        <2f>
Jun 14 01:10:50 devvm kernel:  jiffies              <103829312>
Jun 14 01:10:50 devvm kernel:  next_to_watch.status <0>
Jun 14 01:10:52 devvm kernel: ------------[ cut here ]------------
Jun 14 01:10:52 devvm kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jun 14 01:10:52 devvm kernel: Hardware name: VirtualBox
Jun 14 01:10:52 devvm kernel: NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Jun 14 01:10:52 devvm kernel: Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 14 01:10:52 devvm kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Jun 14 01:10:52 devvm kernel: Call Trace:
Jun 14 01:10:52 devvm kernel: <IRQ>  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
Jun 14 01:10:52 devvm kernel: [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff81459390>] ? dev_watchdog+0x0/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
Jun 14 01:10:52 devvm kernel: [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
Jun 14 01:10:52 devvm kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
Jun 14 01:10:52 devvm kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
Jun 14 01:10:52 devvm kernel: [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jun 14 01:10:52 devvm kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
Jun 14 01:10:52 devvm kernel: <EOI>  [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
Jun 14 01:10:52 devvm kernel: [<ffffffff810149cd>] ? default_idle+0x4d/0xb0
Jun 14 01:10:52 devvm kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff814e433a>] ? rest_init+0x7a/0x80
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jun 14 01:10:52 devvm kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jun 14 01:10:52 devvm kernel: ---[ end trace 2c7bb984812cf120 ]---
Jun 14 01:10:52 devvm kernel: e1000 0000:00:03.0: eth0: Reset adapter
Jun 14 01:10:53 devvm abrtd: Directory 'oops-2013-06-14-01:10:53-1537-0' creation detected
Jun 14 01:10:53 devvm abrt-dump-oops: Reported 1 kernel oopses to Abrt
Jun 14 01:10:53 devvm abrtd: Can't open file '/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0/uid': No such file or directory
Jun 14 01:10:55 devvm kernel: Bridge firewalling registered

After this we see for a while, every two minutes:

Jun 14 01:14:22 devvm kernel: INFO: task events/0:19 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: events/0      D 0000000000000000     0    19      2 0x00000000
Jun 14 01:14:22 devvm kernel: ffff880116c4fb90 0000000000000046 00000000ffffffff 0000000000000008
Jun 14 01:14:22 devvm kernel: 0000000000016680 0000000000016680 ffff880028210400 0000000000016680
Jun 14 01:14:22 devvm kernel: ffff880116c4daf8 ffff880116c4ffd8 000000000000fb88 ffff880116c4daf8
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100975d>] ? __switch_to+0x13d/0x320
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffffa01c5ca5>] e1000_down_and_stop+0x25/0x50 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cb695>] e1000_down+0x155/0x200 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbcb0>] ? e1000_reset_task+0x0/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbd1e>] e1000_reset_task+0x6e/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jun 14 01:14:22 devvm kernel: INFO: task parted:8069 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: parted        D 0000000000000003     0  8069   7994 0x00000080
Jun 14 01:14:22 devvm kernel: ffff8800908b3bb8 0000000000000082 0000000000000000 ffff88010ab50080
Jun 14 01:14:22 devvm kernel: ffff880116c7d500 0000000000000001 0000000000000000 0000000000000000
Jun 14 01:14:22 devvm kernel: ffff88010ab50638 ffff8800908b3fd8 000000000000fb88 ffff88010ab50638
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8112b6d0>] ? lru_add_drain_per_cpu+0x0/0x10
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d177>] flush_work+0x77/0xc0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d2f3>] schedule_on_each_cpu+0x133/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff811ad440>] ? invalidate_bh_lru+0x0/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8112ae35>] lru_add_drain_all+0x15/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff811adf6a>] invalidate_bdev+0x2a/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8125e9a4>] blkdev_ioctl+0x3b4/0x6e0
Jun 14 01:14:22 devvm kernel: [<ffffffff811b381c>] block_ioctl+0x3c/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

In /var/spool/abrt/oops-2013-06-14-01:10:53-1537-0 we can see the following information:

In backtrace:

WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: VirtualBox
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
[<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
[<ffffffff81459390>] ? dev_watchdog+0x0/0x280
[<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
[<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
[<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
[<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81073ca5>] ? irq_exit+0x85/0x90
[<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI>  [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
[<ffffffff810149cd>] ? default_idle+0x4d/0xb0
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e433a>] ? rest_init+0x7a/0x80
[<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
[<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109

In cmdline:

ro root=/dev/mapper/vg_01-lv_root rd_NO_LUKS LANG=en_US.UTF-8  KEYBOARDTYPE=pc KEYTABLE=sv-latin1 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_01/lv_root crashkernel=129M@0M rhgb quiet rd_LVM_LV=vg_01/lv_swap rd_NO_DM rhgb quie

Additional information:

# uname -a
Linux devvm 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
CentOS release 6.3 (Final)

VirtualBox version 4.2.6.

Any insight in how we can proceed with troubleshooting this is appreciated. If you need more information, just let me know.

Best Answer

Trying switching the NIC type. Could be a bug with one of the drivers.