Linux – debian 6 losing a large amount of packets

debiandriversintellinuxnetworking

I have a rather strange problem. We covered all the obvious hardware related issues (different nic, eth cable and switch) however I cannot seem to stop eth dropping packets.

I have 4 servers all exactly the same.

driver: e1000e version: 1.2.20-k2 firmware-version: 1.8-0 bus-info: 0000:06:00.0

They are all running the latest kernel(2.6.32-5-amd64).

However they do this:

RX packets:17073870634 errors:0 dropped:14147208 overruns:0 frame:0

another server:

eth0      Link encap:Ethernet  HWaddr e0:69:95:05:2f:cb
          inet addr:10.10.10.86  Bcast:10.10.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5455209277 errors:0 dropped:375445 overruns:0 frame:0
          TX packets:3666134366 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:6688414486673 (6.0 TiB)  TX bytes:1611812171539 (1.4 TiB)
          Interrupt:20 Memory:d0600000-d0620000

eth1      Link encap:Ethernet  HWaddr 00:1b:21:b7:7a:ce
          inet addr:10.10.0.86  Bcast:10.10.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:15473695728 errors:0 dropped:5808325 overruns:0 frame:0
          TX packets:20112364421 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:9192378766434 (8.3 TiB)  TX bytes:20216368266761 (18.3 TiB)
          Interrupt:17 Memory:d0280000-d02a0000

A massive amount of dropped packets.
I have tried to load on the latest driver, 1.9.5. This did nothing.
I'm not sure what else to do.

This is the problem I get when I try and compile:

make -C /lib/modules/2.6.32-5-amd64/build SUBDIRS=/root/e1000e-2.0.0/src modules
make[1]: Entering directory `/usr/src/linux-headers-2.6.32-5-amd64'
  CC [M]  /root/e1000e-2.0.0/src/netdev.o
/root/e1000e-2.0.0/src/netdev.c: In function âe1000_runtime_resumeâ:
/root/e1000e-2.0.0/src/netdev.c:6681: error: âstruct dev_pm_infoâ has no member named âruntime_autoâ
/root/e1000e-2.0.0/src/netdev.c: At top level:
/root/e1000e-2.0.0/src/netdev.c:7605: error: implicit declaration of function âSET_RUNTIME_PM_OPSâ
/root/e1000e-2.0.0/src/netdev.c:7607: error: initializer element is not constant
/root/e1000e-2.0.0/src/netdev.c:7607: error: (near initialization for âe1000_pm_ops.suspend_noirqâ)
make[4]: *** [/root/e1000e-2.0.0/src/netdev.o] Error 1
make[3]: *** [_module_/root/e1000e-2.0.0/src] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/linux-headers-2.6.32-5-amd64'
make: *** [default] Error 2

I had this problem before.. I think a long time a ago I removed something from netdev.c to get it to work. But this is not right at all.. Why won't it compile?

root@fs3:~/e1000e-2.0.0/src# make
make -C /lib/modules/2.6.32-5-amd64/build SUBDIRS=/root/e1000e-2.0.0/src modules
make[1]: Entering directory `/usr/src/linux-headers-2.6.32-5-amd64'
  CC [M]  /root/e1000e-2.0.0/src/netdev.o
/root/e1000e-2.0.0/src/netdev.c: In function âe1000_runtime_resumeâ:
/root/e1000e-2.0.0/src/netdev.c:6681: error: âstruct dev_pm_infoâ has no member named âruntime_autoâ
/root/e1000e-2.0.0/src/netdev.c: At top level:
/root/e1000e-2.0.0/src/netdev.c:7605: error: implicit declaration of function âSET_RUNTIME_PM_OPSâ
/root/e1000e-2.0.0/src/netdev.c:7607: error: initializer element is not constant
/root/e1000e-2.0.0/src/netdev.c:7607: error: (near initialization for âe1000_pm_ops.suspend_noirqâ)
make[4]: *** [/root/e1000e-2.0.0/src/netdev.o] Error 1
make[3]: *** [_module_/root/e1000e-2.0.0/src] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/linux-headers-2.6.32-5-amd64'
make: *** [default] Error 2

Edit (14/06/12):

It is still causing serious issues. Lately, we have had very high through put on the box, this causes the complete server to just lock up. Only thing to do is, pull the power out and restart it. This is what I found the messages:

un 13 02:25:53 fs1 kernel: [4805019.591934] WARNING: at /build/buildd-linux-2.6_2.6.32-41-amd64-ReqhZF/linux-2.6-2.6.32/debian/build/source_amd64_none/net/sched/sch_generic.c:261 dev_watchdog+0xe2/0x194()
Jun 13 02:25:53 fs1 kernel: [4805019.591938] Hardware name:
Jun 13 02:25:53 fs1 kernel: [4805019.591940] NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Jun 13 02:25:53 fs1 kernel: [4805019.591942] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext3 jbd ext2 joydev usbhid hid fuse hwmon_vid coretemp loop firewire_sbp2 snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec radeon snd_hwdep ttm snd_pcm drm_kms_helper snd_timer snd drm i2c_i801 soundcore snd_page_alloc i2c_algo_bit psmouse i2c_core wmi evdev pcspkr serio_raw button processor ext4 mbcache jbd2 crc16 raid456 md_mod async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sg sd_mod crc_t10dif xhci firewire_ohci ata_generic r8169 firewire_core mii uhci_hcd crc_itu_t ehci_hcd usbcore ahci e1000e libata nls_base scsi_mod thermal thermal_sys [last unloaded: scsi_wait_scan]
Jun 13 02:25:53 fs1 kernel: [4805019.592006] Pid: 0, comm: swapper Tainted: G   M       2.6.32-5-amd64 #1
Jun 13 02:25:53 fs1 kernel: [4805019.592008] Call Trace:
Jun 13 02:25:53 fs1 kernel: [4805019.592010]  <IRQ>  [<ffffffff81263046>] ? dev_watchdog+0xe2/0x194
Jun 13 02:25:53 fs1 kernel: [4805019.592017]  [<ffffffff81263046>] ? dev_watchdog+0xe2/0x194
Jun 13 02:25:53 fs1 kernel: [4805019.592022]  [<ffffffff8104df9c>] ? warn_slowpath_common+0x77/0xa3
Jun 13 02:25:53 fs1 kernel: [4805019.592027]  [<ffffffff81262f64>] ? dev_watchdog+0x0/0x194
Jun 13 02:25:53 fs1 kernel: [4805019.592030]  [<ffffffff8104e024>] ? warn_slowpath_fmt+0x51/0x59
Jun 13 02:25:53 fs1 kernel: [4805019.592038]  [<ffffffff8105a96c>] ? lock_timer_base+0x26/0x4b
Jun 13 02:25:53 fs1 kernel: [4805019.592041]  [<ffffffff81016872>] ? native_sched_clock+0x2e/0x78
Jun 13 02:25:53 fs1 kernel: [4805019.592043]  [<ffffffff8105af0e>] ? __mod_timer+0x141/0x153
Jun 13 02:25:53 fs1 kernel: [4805019.592045]  [<ffffffff81262f38>] ? netif_tx_lock+0x3d/0x69
Jun 13 02:25:53 fs1 kernel: [4805019.592048]  [<ffffffff8124dd63>] ? netdev_drivername+0x3b/0x40
Jun 13 02:25:53 fs1 kernel: [4805019.592051]  [<ffffffff81263046>] ? dev_watchdog+0xe2/0x194
Jun 13 02:25:53 fs1 kernel: [4805019.592053]  [<ffffffff8103fa2a>] ? __wake_up+0x30/0x44
Jun 13 02:25:53 fs1 kernel: [4805019.592055]  [<ffffffff8105a71b>] ? run_timer_softirq+0x1c9/0x268
Jun 13 02:25:53 fs1 kernel: [4805019.592058]  [<ffffffff8106c641>] ? ktime_get+0x5c/0xb7
Jun 13 02:25:53 fs1 kernel: [4805019.592060]  [<ffffffff81053dc7>] ? __do_softirq+0xdd/0x1a6
Jun 13 02:25:53 fs1 kernel: [4805019.592063]  [<ffffffff81011cac>] ? call_softirq+0x1c/0x30
Jun 13 02:25:53 fs1 kernel: [4805019.592065]  [<ffffffff8101322b>] ? do_softirq+0x3f/0x7c
Jun 13 02:25:53 fs1 kernel: [4805019.592067]  [<ffffffff81053c37>] ? irq_exit+0x36/0x76
Jun 13 02:25:53 fs1 kernel: [4805019.592068]  [<ffffffff81012922>] ? do_IRQ+0xa0/0xb6
Jun 13 02:25:53 fs1 kernel: [4805019.592070]  [<ffffffff810114d3>] ? ret_from_intr+0x0/0x11
Jun 13 02:25:53 fs1 kernel: [4805019.592071]  <EOI>  [<ffffffffa021a509>] ? acpi_idle_enter_bm+0x27d/0x2af [processor]
Jun 13 02:25:53 fs1 kernel: [4805019.592079]  [<ffffffffa021a509>] ? acpi_idle_enter_bm+0x27d/0x2af [processor]
Jun 13 02:25:53 fs1 kernel: [4805019.592082]  [<ffffffffa021a502>] ? acpi_idle_enter_bm+0x276/0x2af [processor]
Jun 13 02:25:53 fs1 kernel: [4805019.592085]  [<ffffffff8123a07a>] ? cpuidle_idle_call+0x94/0xee
Jun 13 02:25:53 fs1 kernel: [4805019.592088]  [<ffffffff8100fe97>] ? cpu_idle+0xa2/0xda
Jun 13 02:25:53 fs1 kernel: [4805019.592090]  [<ffffffff8151c140>] ? early_idt_handler+0x0/0x71
Jun 13 02:25:53 fs1 kernel: [4805019.592093]  [<ffffffff8151ccdd>] ? start_kernel+0x3dc/0x3e8
Jun 13 02:25:53 fs1 kernel: [4805019.592095]  [<ffffffff8151c3b7>] ? x86_64_start_kernel+0xf9/0x106
Jun 13 02:25:53 fs1 kernel: [4805019.592096] ---[ end trace 151ce5426d947b45 ]---
Jun 13 02:25:56 fs1 kernel: [4805022.772678] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

It also crashed last night, but it seemed log more out. By reading it, looks like transmit queue issue or some driver issues which casues a back log, ending up in a complete crash.

    Jun 13 21:41:51 fs1 kernel: [48337.715473] active_anon:1212343 inactive_anon:270073 isolated_anon:0
Jun 13 21:41:51 fs1 kernel: [48337.715474]  active_file:578 inactive_file:418 isolated_file:32
Jun 13 21:41:51 fs1 kernel: [48337.715475]  unevictable:0 dirty:0 writeback:0 unstable:0
Jun 13 21:41:51 fs1 kernel: [48337.715475]  free:9290 slab_reclaimable:3780 slab_unreclaimable:9943
Jun 13 21:41:51 fs1 kernel: [48337.715476]  mapped:1153 shmem:234 pagetables:8657 bounce:0
Jun 13 21:41:51 fs1 kernel: [48337.715477] Node 0 DMA free:15884kB min:24kB low:28kB high:36kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15316kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 13 21:41:51 fs1 kernel: [48337.715490] lowmem_reserve[]: 0 3245 6023 6023
Jun 13 21:41:51 fs1 kernel: [48337.715492] Node 0 DMA32 free:16300kB min:5344kB low:6680kB high:8016kB active_anon:2639724kB inactive_anon:527740kB active_file:584kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3323400kB mlocked:0kB dirty:0kB writeback:0kB mapped:636kB shmem:8kB slab_reclaimable:5824kB slab_unreclaimable:6340kB kernel_stack:1328kB pagetables:15440kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1698 all_unreclaimable? yes
Jun 13 21:41:51 fs1 kernel: [48337.715499] lowmem_reserve[]: 0 0 2777 2777
Jun 13 21:41:51 fs1 kernel: [48337.715500] Node 0 Normal free:4408kB min:4572kB low:5712kB high:6856kB active_anon:2209648kB inactive_anon:552552kB active_file:1728kB inactive_file:1540kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2844160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3976kB shmem:928kB slab_reclaimable:9296kB slab_unreclaimable:33432kB kernel_stack:2504kB pagetables:19188kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:5448 all_unreclaimable? yes
Jun 13 21:41:51 fs1 kernel: [48337.715507] lowmem_reserve[]: 0 0 0 0
Jun 13 21:41:51 fs1 kernel: [48337.715509] Node 0 DMA: 3*4kB 4*8kB 2*16kB 4*32kB 1*64kB 2*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15884kB
Jun 13 21:41:51 fs1 kernel: [48337.715514] Node 0 DMA32: 3047*4kB 0*8kB 2*16kB 8*32kB 9*64kB 5*128kB 1*256kB 2*512kB 1*1024kB 0*2048kB 0*4096kB = 15996kB
Jun 13 21:41:51 fs1 kernel: [48337.715519] Node 0 Normal: 214*4kB 2*8kB 1*16kB 16*32kB 15*64kB 8*128kB 4*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4408kB
Jun 13 21:41:51 fs1 kernel: [48337.715524] 4432 total pagecache pages
Jun 13 21:41:51 fs1 kernel: [48337.715525] 2842 pages in swap cache
Jun 13 21:41:51 fs1 kernel: [48337.715526] Swap cache stats: add 614649, delete 611807, find 1264/2278
Jun 13 21:41:51 fs1 kernel: [48337.715527] Free swap  = 16kB
Jun 13 21:41:51 fs1 kernel: [48337.715528] Total swap = 2421752kB
Jun 13 21:41:51 fs1 kernel: [48337.726519] 1572864 pages RAM
Jun 13 21:41:51 fs1 kernel: [48337.726520] 43546 pages reserved
Jun 13 21:41:51 fs1 kernel: [48337.726521] 4985 pages shared
Jun 13 21:41:51 fs1 kernel: [48337.726522] 1516369 pages non-shared
Jun 13 21:41:59 fs1 kernel: [48345.814202] glusterfs invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17
Jun 13 21:42:00 fs1 kernel: [48345.814205] glusterfs cpuset=/ mems_allowed=0
Jun 13 21:42:00 fs1 kernel: [48345.814207] Pid: 2210, comm: glusterfs Tainted: G   M    W  2.6.32-5-amd64 #1
Jun 13 21:42:00 fs1 kernel: [48345.814208] Call Trace:
Jun 13 21:42:00 fs1 kernel: [48345.814213]  [<ffffffff810b6460>] ? oom_kill_process+0x7f/0x23f
Jun 13 21:42:00 fs1 kernel: [48345.814215]  [<ffffffff810b6984>] ? __out_of_memory+0x12a/0x141
Jun 13 21:42:00 fs1 kernel: [48345.814217]  [<ffffffff810b6adb>] ? out_of_memory+0x140/0x172
Jun 13 21:42:00 fs1 kernel: [48345.814219]  [<ffffffff810ba840>] ? __alloc_pages_nodemask+0x4ec/0x5fc
Jun 13 21:42:00 fs1 kernel: [48345.814222]  [<ffffffff812fbb6a>] ? io_schedule+0x93/0xb7
Jun 13 21:42:00 fs1 kernel: [48345.814225]  [<ffffffff810bbda9>] ? __do_page_cache_readahead+0x9b/0x1b4
Jun 13 21:42:00 fs1 kernel: [48345.814227]  [<ffffffff81065070>] ? wake_bit_function+0x0/0x23
Jun 13 21:42:00 fs1 kernel: [48345.814229]  [<ffffffff810bbede>] ? ra_submit+0x1c/0x20
Jun 13 21:42:00 fs1 kernel: [48345.814231]  [<ffffffff810b4bab>] ? filemap_fault+0x17d/0x2f6
Jun 13 21:42:00 fs1 kernel: [48345.814234]  [<ffffffff810cab4a>] ? __do_fault+0x54/0x3c3
Jun 13 21:42:00 fs1 kernel: [48345.814236]  [<ffffffff810cce9e>] ? handle_mm_fault+0x3b8/0x80f
Jun 13 21:42:00 fs1 kernel: [48345.814239]  [<ffffffff812ff2e6>] ? do_page_fault+0x2e0/0x2fc
Jun 13 21:42:00 fs1 kernel: [48345.814241]  [<ffffffff812fd185>] ? page_fault+0x25/0x30
Jun 13 21:42:00 fs1 kernel: [48345.814242] Mem-Info:

The only thing I can think of doing is installing this however. I am running Debian not Centos, so I cannot install the "mod-e1000e" driver. Any more insight into this would be great.

Best Answer

This is a known issue with certain e1000e chipset drivers. I experienced the same packet loss problem last year, and compiling the latest Intel drivers from source fixed it for me. I'm currently using version 1.5.1 without any problems.

Once you've compiled the module, copy it to the correct modules directory (although this should happen automatically when you make install). I'm not sure what the exact directory is on Debian; on my Ubuntu system running kernel 2.6.38-8 the correct directory was /lib/modules/2.6.38-8-generic/kernel/drivers/net/e1000e/e1000e.ko so yours will likely be a slight variation of this.

Now, before rebooting, be sure to unload the existing module as root using modprobe -rv e1000e and then manually try to load the new module using modprobe e1000e. A ping test should now show no more packet loss. If, after rebooting your system, you find you're experiencing packet loss again, it's possible that an older version of the module has been loaded again instead of the newly compiled module, in which case you should take a look at this previously answered question to solve that problem.

Take a look through the README file in the Intel driver source directory; the installation instructions are fairly helpful.