How to prepare for swapping a NIC using network teaming on a Windows cluster host

hardwarenetwork-teamingnicwindows-clusterwindows-server-2012-r2

Update: I have now performed the upgrade. I used the half-ninja, half-hack solution of plugging in USB-to-Ethernet adapters that I could add to the teams to hold the fort. I plugged in one per team, removed the other affected adapters in the team, shut down Windows, swapped out the card, made sure the USB adapters were in the same USB port and would connect in the same way and booted up. The USB adapters were still there and I was able to restore the team configuration by manually adding the new NICs to the teams.

This solution was first proposed by @Drifter104 in a comment. @shouldbeq931 was the first answer to propose adding another card to sidestep the problem and received the bounty. Both answers were helpful so in fairness I am marking @llorrac's exhaustive top-voted answer as the answer, which pointed out the importance of removing the NICs on the broken card from the teams before swapping it out.

I still don't exactly know what happens when you don't do this or what Microsoft's guidance for swapping out cards are – but that's Microsoft's fault and I appreciate the help I got here.


Original question:
I am administering a Windows Server 2012 R2 cluster running Hyper-V workloads. All cluster nodes have several networks, served by several physical network cards, where Windows Server's NIC teaming is used to team two ports together (the teams never span physical network cards). A port on a physical network card on one of the cluster nodes has recently experienced failure, and that port has been removed from its team and a new physical network card of identical make and model has been ordered.

  • If I replace the card as is and plug everything in the same way, will everything be picked back up by the NIC teaming? By cluster networks? The physical network card will be in the same slot and of the same model, but the MAC addresses will clearly be different, and I don't know if the tags that Dell put on the various ports to correlate them (there's an acronym for this but it eludes me) will be available.

  • If not, will I need to tear everything down and reconfigure the teams/cluster networks?

  • Is there any good official guidance or other advice about how to go about this? I haven't found anything, but I don't know quite what to search for. (The closest is this forum thread which is written back when network teaming wasn't provided by Windows Server and someone had to use a hardware solution from the vendor, so Microsoft's response to that situation was "you're on your own".)

Edit: hopefully this question will answer the general question "does stuff break and if so, how can I avoid that", but I realize that more details will be helpful so I am providing them.

The server has a total of six ports, divided on two cards. One card has two 10 Gbit ports, and a team spanning both ports. One card has two 10 Gbit ports and a team spanning both ports, as well as two 1 Gbit ports and a team spanning both ports. The 1 Gbit team is hooked up to our general network switch. The two 10 Gbit teams are hooked up point-to-point directly to our storage server and to the other cluster node, and the networking all works out with hard coded IP addresses and without a switch. (This works but I would not recommend it, nor would I repeat it in a new configuration. So yes, I know that it is horrible and prevents a bunch of useful things with VLAN and network hygiene. As far as I can tell it doesn't have an impact on what I'm asking, which is how Windows Server NIC teaming reacts to changed hardware.) The malfunctioning port is in one of the 10 Gbit teams. All teams use the Switch Independent teaming mode (since there's no switch).

Best Answer

This is an important question and I'd say a more common scenario than it appears from your searching.

As you may know there are three types of teaming provided by MS Server. 1. Active / Standby 2. Static 3. LACP

Based on your statement about whether you will have to

tear everything down

it sounds to me like you are using Static teaming which requires more manual config than the other two.

Regarding replacing the NIC.

Despite which teaming you use, you have to make sure that your dead NIC is disabled in the team settings before unplugging anything!!!

Will it be picked up by teaming when you plug the new NIC in? Yes, but depending on which configuration you're using, you may need to manually add it to your team.

  1. Remove NIC from team
  2. Remove physical NIC
  3. Replace physical NIC
  4. Add new NIC to team

Check out this document from Microsoft tech net for reference - 4.6 Checking the status of a team. There are options for editing team settings visually or through powershell.

Regarding MAC address and cluster networks.

Again, per documentation, the receivers of teamed data will be resolving the single IP and rest on one primary MAC address from the pool. As such, if you follow the steps in the attached documentation, you should run into no errors with MAC address config.

In summary.

I once had to conduct a post incident review in a similar situation. The engineer planned to shut down a switch to replace it, but didn't remove it from the pool. This meant that when he shut the switch down, all network traffic was lost and caused errors to be played out to +250k end user devices. ¯_(ツ)_/¯

Check out the docs - there is some other stuff specific to hyper-v that might make more sense to you.