No internet after dhcp lease renews

dhcpinternetnetworkingrouting

Today we had a number of machines stop getting internet access. After a lot of troubleshooting, the common thread is that they all had their dhcp lease renewed today (we're on 8 day leases here).

Everything you would expect looks good after the lease renewal: they have a valid IP address, dns server, and gateway. They have access to internal resources (file shares, intranet, printers, etc). A little more troubleshooting reveals they are unable to ping or tracert to our gateway, but they can get to our core layer3 switch just in front of the gateway. Assigning a static IP to the machine works as a temporary solution.

One final wrinkle is that so far reports have only come in for clients on the same vlan as the gateway. Our administrative staff and faculty is on the same vlan as the servers and printers, but phones, key fob/cameras, students/wifi, and labs each have their own vlans and as far as I've seen nothing on any of the other vlans has had a problem yet.

I have a separate ticket in with the gateway vendor, but I suspect they'll take the easy out and tell me the problem is elsewhere on the network, so I'm asking here as well. I've cleared arp caches on the gateway and core switch. Any ideas welcome.

Update:
I tried pinging from the gateway back to some affected hosts, and the odd thing is that I did get a response: from a completely different IP address. I tried a few more at random and eventually got this:

Fri Sep 02 2011 13:08:51 GMT-0500 (Central Daylight Time)
PING 10.1.1.97 (10.1.1.97) 56(84) bytes of data.
64 bytes from 10.1.1.105: icmp_seq=1 ttl=255 time=1.35 ms
64 bytes from 10.1.1.97: icmp_seq=1 ttl=255 time=39.9 ms (DUP!)

10.1.1.97 is the actual intended target of the ping. 10.1.1.105 is supposed to be a printer in another building. I have never seen a DUP in a ping response before.

My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway.

…continued. I've now powered down the offending printer, and pings to an affected host from the gateway just fail completely.

Update 2:
I check arp tables at an effected machine, the gateway, and every switch between them. At each point, the entries for those devices were all correct. I didn't verify every entry in the table, but every entry that could possibly impact traffic between the host and the gateway was okay. ARP is not the problem.

Update 3:
Things are working at the moment, but I can't see anything I did to fix them and so I have no idea whether this might be just a temporary lull. Anyway, there's not much I can do to diagnose or troubleshoot now, but I'll update more if it breaks again.

Best Answer

"My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway."

This happened in my office. The offending device turned out to be a rogue android device:

http://code.google.com/p/android/issues/detail?id=11236

If the android device gets the gateway's IP from another network via DHCP, it may join your network and start responding to ARP requests for the gateway IP with it's MAC. Your use of the common 10.1.1.0/24 network increases the probability of this rogue scenario.

I was able to check the ARP cache on an affected workstation on the network. There, I observed an ARP flux problem where the workstation would flip-flop between the correct MAC and a MAC address from some rogue device. When I looked up the suspicious MAC the workstation had for the gateway, it came back with a Samsung prefix. The astute user with the troubled workstation replied that he knew who had a Samsung device on our network. Turned out to be the CEO.

Related Topic