Hyper V Live Migration Only Goes One Way (Error 21502)

failoverclusterlive-migrationwindows-clusterwindows-server-2008-r2

We've been running into an issue recently with one of our server stacks. Our two 2008 R2 servers are running in a cluster set up to live migrate VMs between eachother in case there is ever a detected fault.

The servers are the exact same hardware-wise; they were ordered specifically for this purpose. Live migration had been working fine up until a couple months ago when we noticed that VIR001 could not migrate to VIR002. I've looked into this issue and I know that generally it is caused by improperly-named resources, but that doesn't seem to be the case here.

VIR002 will live migrate any of its hosted VMs over to VIR001. VIR001 will not LM any VMs over to VIR002. Not sure where to start with this, I've noticed a couple Time-Server errors on VIR001, but if the issue was due to a sync problem, wouldn't both servers experience the same issue?

Right now, looking for ideas on what to check. Thanks,

(Update: I've ran the Failover Cluster Validation tool and it found no issues. I could not run the Disk validation as our cluster is still online with the cluster. Both servers in question are also set as possible owners for cluster resources)

Best Answer

Well, finally found the issue:

I noticed that some of the created cluster networks were not legitimate (ie, they only contained one NIC, or were teamed with a NIC on a different subnet). I had disabled these. I was told by my colleagues that binding on the physical servers could make a difference. I changed these. I verified the cluster, made sure all nodes had both servers listed as possible owners, and to top it off, I had found the "Network for Live Migration" tab under properties for the Virtual Machine Resource.

I had ordered the cluster networks in "Network for Live Migration" in such a way that the Live Migration cluster network was first, followed by all active networks, with the disabled networks at the bottom. No love. Today after changing the binding and seeing no change, I decided to disable the all cluster networks in the Live Migration tab beyond three internal networks (LM, host, Cluster Domain). Now it's working.

Not sure what caused this to begin with. We haven't made any physical changes to the hardware in the last year. This was working at least 4 months ago. Looks like the Cluster manager doesn't always listen to its own settings.

Thanks for the replies on this question.

Related Topic