ESXi VMKernel Port Issue in Active/Active Configuration

cisco-catalystvmware-esxi

I have the below, simplified configuration:

enter image description here

Essentially, I have an ESXi host with two physical network adapters. Each adapter plugs in to a different switch. Each switch is connected via a trunk port. A PC is connected to one of the switches. A vSwitch with a VMKernel port and VM ports is configured to use both physical NICs in an Active/Active configuration:

enter image description here

I have run esxtop and can see that the ESXi host has chosen the physical NIC connected to Switch 2 for the VMKernel port. From the PC, if I ping the management IP address of the ESXi host the pings are intermittent. They go up and down.

If I show the mac address-table on each switch, I see that Switch 2 always has the VMKernel's MAC address assigned to the switch port connected to the ESXi host. But, Switch 1 continually adds and removes the VMKernel's MAC address on it's respective physical port. Anytime Switch 1 has the VMKernel's MAC assigned to its physical port, the pings fail.

The reason for the failure is obvious. The reason why Switch 1 is routinely picking up the MAC address of the ESXi VMKernel port is the question. The ESXi host has chosen the interface connected to Switch 2 to be the active port. The interface connected to Switch 1 should be inactive. But, it would appear that it is possibly responding to ARP requests?

It's worth noting that none of the VMs on this host have this problem. They are all reachable and are present in only one MAC table at a time. This problem specifically affects the VMKernel port.

What about this configuration is wrong? I am looking for some type of documentation or explanation on top of a solution to this issue. I know that setting the VMKernel port to be Active/Standby mode will probably solve the issue. But, I can't find anything documented why this current configuration is a problem.

UPDATES:

  • I disabled CDP on the vSwitch thinking that it might be causing communication over the inactive NIC.
  • I overrode the vSwitch settings for the VMKernel port and set it to use explicit failover and Active/Standby. I also placed the standby NIC in the unused pool. None of it helped. What did solve the issue was changing the port order around. So, when the port connected to Switch 1 becomes active, I do not see the issue. The MAC address does not become active on Switch 2 at all. These are two significantly different NIC cards, and I'm wondering if this isn't some kind of driver issue.

Something has to be causing the VMKernel MAC address to be seen on Switch 1's port, but it comes and goes every several seconds.

Switch configs for STP and ports:
Switch 1

!
spanning-tree mode rapid-pvst
spanning-tree portfast edge default
spanning-tree extend system-id
!
interface Port-channel1
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
!
interface GigabitEthernet1/0/7
 switchport access vlan 11
 switchport mode access
!
interface GigabitEthernet1/0/23
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
 channel-group 1 mode desirable
!
interface GigabitEthernet1/0/24
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
 channel-group 1 mode desirable

Switch 2

!
spanning-tree mode rapid-pvst
spanning-tree portfast edge default
spanning-tree extend system-id
!
interface Port-channel1
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
!
interface GigabitEthernet1/0/3
 switchport access vlan 11
 switchport mode access
!
interface GigabitEthernet1/0/23
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
 channel-group 1 mode desirable
!
interface GigabitEthernet1/0/24
 switchport access vlan 11
 switchport trunk encapsulation dot1q
 switchport mode trunk
 channel-group 1 mode desirable

Best Answer

The management vmk in ESXI assumes the MAC address of the Nic in the first PCI slot during the initial set-up. This is how it has worked forever. This can break things only when the physical device also starts sending packets. This normally does not happen, physical Nics do not send traffic, they pass traffic along. This behavior also needs to be paid attention to if you decide to move physical Nics from one host to another, this brings down 2 host connections when the physical switch freaks out. My guess is that this Nic started reporting CDP/LLDP traffic and this is when your switch sees the MAC duplication. The easiest solution is to rebuild the vmk through the command line. This will need to be done from a direct console access (DCUI) (KVM, ILO, IDRAC, etc...).

Here are the commands; (Adjust the IP's/subnet mask/portgroup name etc... to fit your needs.)

esxcli network ip interface remove --interface-name=vmk0

esxcli network vswitch standard portgroup add -p Management_Network -v vSwitch0

esxcli network ip interface add --interface-name=vmk0 --portgroup-name=Management_Network

esxcli network vswitch standard portgroup set -p Management_Network --vlan-id 50

esxcli network ip interface ipv4 set --interface-name=vmk0 --ipv4=192.168.50.116 --netmask=255.255.255.0 --gateway=192.168.50.1 --type=static

esxcli network ip interface tag add -i vmk0 -t Management

This will rebuild the management vmk with a VMware MAC address to eliminate this issue. However, I would recommend that you reach out to the hardware vendor/manufacturer for the process of shutting the CDP/LLDP coming from the physical card. This will resolve this one ESXi host issue, but you will end up with it happening to others if you allow the card(s) to continue to perform this function. If this was as big a problem as you had originally thought, VMware would not be a giant company, this is not very common...