Windows 8 is not always selecting correct network card for the outgoing transmissions

arprouting

The company I work for builds and sells industrial machines. One of our products is a machine which is controlled by a PC running Windows. This particular machine uses a networked device with digital inputs and outputs connected to the machine. Our software sends commands over the Ethernet network to read and write the values of I/O points on this device. The device uses the UDP protocol for communication.

The PCs we use typically have two or more network cards (NICs). One of these NICs we name the Machine LAN and is assigned a private address of 192.168.1.49/24. The I/O devices have IP addresses of 192.168.1.11/24, 192.168.1.12/24, etc.

A second NIC can be connected to the mill's (the customer's) general network and is called the Mill LAN. This is usually configured for DHCP addressing.

Our application is configured with the IP address of the I/O device and thus generates UDP traffic for that address. Under normal circumstances I can monitor this traffic with Wireshark and see UDP packets flowing back and forth to the device's IP address through the Machine LAN interface. I can also ping the I/O device and watch the ICMP packets bounce back and forth between the PC and the I/O device through the Machine LAN interface.

Because this is an industrial application we want to make sure everything is as robust as possible and that our application recovers from things like network failures. To this end I conduct tests here at our manufacturing facility where I disconnect the I/O device from the network, monitor our application's behavior, then reconnect the I/O device and ensure the application starts talking to the device again. Sometimes everything recovers and sometimes it doesn't. It appears to me that sometimes conducting this test causes Windows to start sending traffic for the 192.168.1.11 address through the Mill LAN interface and not the Machine LAN interface. When this happens there are obviously no responses from the I/O device and the application is unable to interact with the device. I have studied the PC's network configuration and routing tables as well as spent a lot of time searching the Internet for ideas but I cannot figure the reason for this behavior.

I have confirmed that Windows is sending IP traffic to the Mill LAN interface and not to the Machine LAN interface by observing the traffic with Wireshark. I can observe this with both the UDP packets generated by my application and the ICMP packets generated by ping.exe, and I therefore conclude that the problem lies outside our application.

One of the things I have tried is manipulating the routing metrics (both the interface and gateway metrics) in an attempt to coerce Windows into using the Machine LAN interface. That doesn't seem to help. You will see these adjusted/exaggerated metrics in the configuration listings below.

When the symptom occurs I can still successfully ping the I/O device if I explicitly tell ping.exe which interface to use:

C:\>ping -S 192.168.1.49 192.168.1.11

Pinging 192.168.1.11 from 192.168.1.49 with 32 bytes of data:
Reply from 192.168.1.11: bytes=32 time=6ms TTL=16
Reply from 192.168.1.11: bytes=32 time=7ms TTL=16
Reply from 192.168.1.11: bytes=32 time=7ms TTL=16
Reply from 192.168.1.11: bytes=32 time=7ms TTL=16

Ping statistics for 192.168.1.11:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 6ms, Maximum = 7ms, Average = 6ms

The symptom sometimes go away by itself after a short time, but usually it persists for a long time (I assume indefinitely). I can also make the symptom go away by disabling the Mill LAN interface; this makes sense because Windows now has only one interface to route all traffic. I can also make the symptom go away by deleting the ARP entry for the I/O device (I have no idea why this works):

C:\>arp -d 192.168.1.11

When the symptom is occurring I can still ping other devices on the Machine LAN, so the routing of packets through appropriate interfaces seems to be working in general (just not for the one particular address). Whatever the phenomenon is it appears to be related to a single IP address. Since deleting the ARP record for that address makes the symptom go away I suspect something ARP related, but I don't really know for sure.

It appears that the ARP entry for 192.168.1.11 goes away when the symptom is happening. Before the symptom starts there is an entry (with the correct MAC address):

C:\>arp -a | findstr 192.168.1.11
  192.168.1.11          00-50-8e-00-26-e2     dynamic

After inducing the symptom the entry is gone:

C:\>arp -a | findstr 192.168.1.11

C:\>

For whatever reason it seems that deleting the non-existent ARP entry restores communications.

One additional observation: I monitored output from a continuous ping (ping -t 192.168.1.11). Here is a case where I was able to unplug the cable for a few seconds, plug it back in, and ping was able to resume talking:

Reply from 192.168.1.11: bytes=32 time=9ms TTL=16
Reply from 192.168.1.11: bytes=32 time=6ms TTL=16
Request timed out.
Request timed out.
Reply from 192.168.1.11: bytes=32 time=2005ms TTL=16
Reply from 192.168.1.11: bytes=32 time=6ms TTL=16
Reply from 192.168.1.11: bytes=32 time=6ms TTL=16

It seems that when the symptoms starts (communications does not recover) I see the "Destination host unreachable" message:

Reply from 192.168.1.11: bytes=32 time=9ms TTL=16
Reply from 192.168.1.11: bytes=32 time=6ms TTL=16
Request timed out.
Request timed out.
Reply from 192.168.1.49: Destination host unreachable.
Request timed out.
Request timed out.

I'm not 100% sure that this is always the case.

Here are the interfaces (note the metrics which I manually assigned):

C:\>netsh interface ip show config

Configuration for interface "Machine LAN"
    DHCP enabled:                         No
    IP Address:                           192.168.1.49
    Subnet Prefix:                        192.168.1.0/24 (mask 255.255.255.0)
    Default Gateway:                      0.0.0.0
    Gateway Metric:                       1
    InterfaceMetric:                      1
    Statically Configured DNS Servers:    None
    Register with which suffix:           Primary only
    Statically Configured WINS Servers:   None

Configuration for interface "Mill LAN"
    DHCP enabled:                         Yes
    IP Address:                           ***.16.1.31
    Subnet Prefix:                        ***.16.0.0/20 (mask 255.255.240.0)
    Default Gateway:                      ***.16.0.58
    Gateway Metric:                       500
    InterfaceMetric:                      500
    DNS servers configured through DHCP:  ***.16.6.20
                                          ***.16.16.131
    Register with which suffix:           Primary only
    WINS servers configured through DHCP: ***.16.6.20
                                          ***.16.16.131

Configuration for interface "Loopback Pseudo-Interface 1"
    DHCP enabled:                         No
    IP Address:                           127.0.0.1
    Subnet Prefix:                        127.0.0.0/8 (mask 255.0.0.0)
    InterfaceMetric:                      50
    Statically Configured DNS Servers:    None
    Register with which suffix:           None
    Statically Configured WINS Servers:   None

Here is the routing table (as rendered by both netsh and route commands):

C:\>netsh int ip show route

Publish  Type      Met  Prefix                    Idx  Gateway/Interface Name
-------  --------  ---  ------------------------  ---  ------------------------
No       Manual    100  0.0.0.0/0                   3  ***.16.0.58
No       Manual    1    0.0.0.0/0                   4  Machine LAN
No       System    256  ***.16.0.0/20               3  Mill LAN
No       System    256  ***.16.1.31/32              3  Mill LAN
No       System    256  ***.16.15.255/32            3  Mill LAN
No       Manual    1    192.168.1.0/24              4  Machine LAN
No       System    256  192.168.1.49/32             4  Machine LAN
No       System    256  192.168.1.255/32            4  Machine LAN
No       System    256  224.0.0.0/4                 3  Mill LAN
No       System    256  224.0.0.0/4                 4  Machine LAN
No       System    256  255.255.255.255/32          3  Mill LAN
No       System    256  255.255.255.255/32          4  Machine LAN


C:\>route print
===========================================================================
Interface List
  4...00 40 05 10 4e 9c ......D-Link DFE-530TX+ PCI Adapter
  3...00 1a a0 e8 72 59 ......Intel(R) 82566DM-2 Gigabit Network Connection
  1...........................Software Loopback Interface 1
  5...00 00 00 00 00 00 00 e0 Microsoft ISATAP Adapter
  7...00 00 00 00 00 00 00 e0 Microsoft ISATAP Adapter #2
===========================================================================

IPv4 Route Table
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0      ***.16.0.58      ***.16.1.31    600
          0.0.0.0          0.0.0.0         On-link      192.168.1.49      2
       ***.16.0.0    255.255.240.0         On-link       ***.16.1.31    756
      ***.16.1.31  255.255.255.255         On-link       ***.16.1.31    756
    ***.16.15.255  255.255.255.255         On-link       ***.16.1.31    756
      192.168.1.0    255.255.255.0         On-link      192.168.1.49      2
     192.168.1.49  255.255.255.255         On-link      192.168.1.49    257
    192.168.1.255  255.255.255.255         On-link      192.168.1.49    257
        224.0.0.0        240.0.0.0         On-link       ***.16.1.31    756
        224.0.0.0        240.0.0.0         On-link      192.168.1.49    257
  255.255.255.255  255.255.255.255         On-link       ***.16.1.31    756
  255.255.255.255  255.255.255.255         On-link      192.168.1.49    257
===========================================================================
Persistent Routes:
  Network Address          Netmask  Gateway Address  Metric
          0.0.0.0          0.0.0.0     192.168.1.49       1
===========================================================================

I have seen the same symptoms on XP, Windows 7, and Windows 8 PCs although I have only used Wireshark to observe the traffic going through the wrong interface on Windows 8.

Confession time: We do not have any nodes on the Machine LAN that have an address of 192.168.1.1, but I get ping responses from that address through the Mill LAN interface. Something somewhere out on (or accessible from) the Mill LAN has that address. Here's a tracert which shows it is only one hop away and probably on my company's internal network:

C:\>tracert 192.168.1.1

Tracing route to 192.168.1.1 over a maximum of 30 hops

  1    <1 ms    <1 ms    <1 ms  ***.16.0.58
  2    12 ms    47 ms    24 ms  192.168.1.1

Trace complete.

I assume that the existence of this 192.168.1.1 device probably constitutes an incorrectly configured network and that I should investigate why it is visible to my PC (I didn't think these private addresses were supposed to be routable). In any case I would like to figure out how to make things work as they are because in my experience devices with 192.168.1.* addresses occasionally do appear at customer sites (on the Mill LAN) and I would like our system to continue to work even if they do. In other words, I would like to have my PC only use the Machine LAN interface for traffic with the 192 addresses. If anyone has any ideas how I can accomplish that I'd love to hear them!

Best Answer

I was first going to say that this question would be better answered on Superuser or Serverfault, but I want to address a strategic problem you will have:

You have chosen to use 192.168.0.0 for your "private" LAN. Unfortunately, you have chosen the most commonly used private network address, and you will likely run into address conflicts often -- you seem to have done so here.

It's not true that 192.168.0.0 addresses can't be routed. They can, and are routed all the time within a company network. They can't be routed over the Internet, however. You are probably thinking of the "link local" network, 169.254.0.0/16. That network is not (supposed to be) routed at all, so you won't have the address conflicts you are experiencing.

You should use addresses from the 169.254.0.0/16 address range. Pick a small subnet out of that range for the number of devices you have (e.g. 169.254.55.64/28 for fewer than about 10 I/O devices).