Router – High ping latency over direct Gigabit Ethernet link on Enterprise-class hardware

gigabit-ethernetlatencymodempingrouter

FINAL EDIT 7/7 Multiple cable, port, and device substitutions have narrowed this to the Comcast modem, and on all 8 of its Ethernet ports. Given, as the comments say, this is a closed device, and unlikely to get much real information, we may never find out, but I will post the resolution as an answer nonetheless.

(Edit 1/6 motivation; underlying real issue)
Based on user complaints of teleconference issues, I wanted to eliminate all possible causes. After verifying picture-perfect LAN and WiFi connectivity from laptop to server, then seeing very uneven end-to-end ping latency — often well above the 100ms suggested limit — with the Google Meet server, as directed here, I backtracked to the source of the latency. (This is not necessarily the root cause of Google Meet delays, of course, but I need to eliminate this as a possible cause.)

It turns out the uneven and high latency is coming from the direct link between a Sophos UTM 9 SG125 (Firmware: 9.703-3) to a Comcast CGA4341COM Gigabit modem (Manufacturer: Technicolor; Hardware revision: 2.3; Chipset: Broadcom).

Both link sides report Gigabit connection. Speedtest to speedtest.xfinity.com gives 400Mbps range results from a hardwired on-LAN server. (Edit 2/6 Additional evidence of issue This seems great except that when same server is wired directly to the modem, cutting out the router and the rest of the LAN entirely, the throughput is 930Mbps.)

After a long test on SSH cmd line on the Sophos to the directly-connected modem using a 10-foot Cat5e cable:

—— xx.xx.xx.134 ping statistics ---
756 packets transmitted, 756 received, 0% packet loss, time 755277ms
rtt min/avg/max/mdev = 0.162/21.789/199.543/34.605 ms

The long pings are quite densely interspersed:

64 bytes from xx.xx.xx.134: icmp_seq=1 ttl=64 time=58.2 ms
64 bytes from xx.xx.xx.134: icmp_seq=2 ttl=64 time=0.645 ms
64 bytes from xx.xx.xx.134: icmp_seq=3 ttl=64 time=72.4 ms

Trying this all day long changed nothing. All extraneous features of the modem are disabled: port forwarding, port triggering, firewall, MAC access control, dhcp, wifi, etc.

(Edit 3/6 Re: Possible traffic load accounting for delay) This occurred in the middle of the night as well, and so is not traffic-dependent. When the link utilization under 1%, despite the possible inaccuracy of ping, prioritization should not play a factor.

(Edit 4/6 Re: Possible low prioritization of ICMP) traceroute, using UDP, shows identical delay patterns:

traceroute -q 10 -w 1 10.1.10.1
traceroute to 10.1.10.1 (10.1.10.1), 30 hops max, 40 byte packets using UDP
 1  10.1.10.1 (10.1.10.1)  71.784 ms   70.684 ms * * *   66.310 ms * * * *
traceroute -q 10 -w 1 10.1.10.1
traceroute to 10.1.10.1 (10.1.10.1), 30 hops max, 40 byte packets using UDP
 1  10.1.10.1 (10.1.10.1)  1.218 ms   1.151 ms * * * * * * * *
traceroute -q 10 -w 1 10.1.10.1
traceroute to 10.1.10.1 (10.1.10.1), 30 hops max, 40 byte packets using UDP
 1  10.1.10.1 (10.1.10.1)  61.156 ms * * * *   55.497 ms   54.370 ms * * *

Edit 5/6 Re: Normal behavior for this ISP and modem At different customer site, identical modem hardware connected to a $65 EdgeRouter X, I see from the router, normalcy:

--- 10.1.10.1 ping statistics ---
60 packets transmitted, 60 received, 0% packet loss, time 59398ms
rtt min/avg/max/mdev = 0.278/1.201/2.175/0.554 ms

Similarly, after 100 traceroute UDP packets sent at this 2nd site, the slowest of all was 3.3ms.

(Edit 6/6 Re: Possible normalcy, generally) Between any modem and router I've never seen this delay pattern in years of working with broadband, both low and high end, with multiple vendors. I have not discounted the Sophos side; I will report when I can get onsite with a direct connect to a different device.

On the Sophos, no packet errors:

router:/var/log# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 7C:xx:xx:xx:xx:94  
          inet addr:96.xx.xx.129  Bcast:96.xx.xx.135  Mask:255.255.255.248
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:300119356 errors:0 dropped:0 overruns:0 frame:0
          TX packets:243077712 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:264200277517 (251961.0 Mb)  TX bytes:197347533783 (188205.2 Mb)

On eth0, the Sophos is pinging a server on-LAN with the typical 0.1-0.2ms very steady latency.

router:/# ping 192.168.1.5
PING 192.168.1.5 (192.168.1.5) 56(84) bytes of data.
64 bytes from 192.168.1.5: icmp_seq=1 ttl=128 time=0.198 ms
64 bytes from 192.168.1.5: icmp_seq=2 ttl=128 time=0.128 ms

There's no loading (CPU or memory or disk) on the Sophos at all, nor anything remarkable in the logs, nor anything in dmesg.

lshw reports:

      *-network:1
            description: Ethernet interface
            product: Ethernet Connection X553 1GbE
            vendor: Intel Corporation
            physical id: 0.1
            bus info: pci@0000:0b:00.1
            logical name: eth1
            version: 11
            serial: 7c:xx:xx:xx:xx:94
            size: 1Gbit/s
            capacity: 1Gbit/s
            width: 64 bits
            clock: 33MHz
            capabilities: pm msi msix pciexpress bus_master cap_list rom ethernet physical tp 10bt-fd 100bt-fd 1000bt-fd autonegotiation
            configuration: autonegotiation=on broadcast=yes driver=ixgbe driverversion=5.2.4 duplex=full firmware=0x80000878 ip=96.86.73.129 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
            resources: irq:17 memory:dfa00000-dfbfffff memory:dfe00000-dfe03fff memory:dc500000-dc57ffff

The modem has almost nothing to report when searching 90 days of logs. Today, only:

FW.WANATTACK DROP , 34 Attempts, 2020/6/16 15:58:01
Firewall Blocked

Detailed software stats on modem:

eMTA & DOCSIS Software Version: CM DOCSIS Application - Prod_18.1_d31 & MTA Application - Prod_18.1
Software Image Name: CGA4131COM_3.12p12s1_PROD_sey
Advanced Services: CGA4131COM
Packet Cable: 2.0

Best Answer

Defective Comcast modem, as @Appleoddity mentioned. Only solution was to have it swapped out. This post, linked to their community board, helped convince them all the homework had been done. So no objection, cost, or delay to do it. Closed system grumble grumble. The tech had not seen this problem before.