We have a setup here with around 30 Dom0s running Xen 4.0.3 with a Dom0 Kernel of 2.6.32.57 x86_64.
(We have seen the same behaviour with Xen 4.0.1 and kernel 2.6.32.2X earlier)
Sometimes all of a sudden, xen stops adding vifs for new (or migrated) DomUs correctly. Interfaces are there, added to the correct bridge but the bridge-port never receives any traffic. All at this time already connected interfaces work without problems. This happens to all bridges on the dom0 at the same time (we have 11 bridges for 11 vlans and 4 physical interfaces per host, stp on the bridges is off).
If it happens I see this in the log when adding an interface through xen, which seems to be missing the bridge entering forwarding state for the just added interfaces:
[809766.761058] device r624-eth0 entered promiscuous mode
[809766.773664] br-vlan2801: port 1(r624-eth0) entering learning state
[809766.857665] device r624-eth1 entered promiscuous mode
[809766.872226] br-vlan2802: port 2(r624-eth1) entering learning state
[809768.377613] blkback: ring-ref 8, event-channel 8, protocol 2 (x86_32-abi)
[809776.810481] r624-eth0: no IPv6 routers present
[809777.870549] r624-eth1: no IPv6 routers present
The IP of r624-eth0
ist not pingable afterwards. tcpdump -i br-vlan2801
shows the ARP requests of the pinging host, tcpdump -i r624-eth0
shows nothing. So the packets reach the bridge but are not forwarded to the vif (to my understanding). Taking down the bridge via ifconfig br-vlan2801 down
does not help – but deleting and recreating the bridge solves the problem. This leads me to the conclusion that Xen is not part of the problem here.
If I just restart the bridge interface via ifconfig br-vlan2801 down / up
I see this:
Jul 5 16:43:52 kernel: [811367.029655] br-vlan2159: port 4(b434-eth1) entering disabled state
Jul 5 16:43:52 kernel: [811367.029893] br-vlan2159: port 3(d434-eth1) entering disabled state
Jul 5 16:43:52 kernel: [811367.030121] br-vlan2159: port 2(w434-eth1) entering disabled state
Jul 5 16:43:52 kernel: [811367.030350] br-vlan2159: port 1(eth0.2159) entering disabled state
Jul 5 16:44:15 kernel: [811389.818841] br-vlan2159: port 4(b434-eth1) entering learning state
Jul 5 16:44:15 kernel: [811389.819076] br-vlan2159: port 3(d434-eth1) entering learning state
Jul 5 16:44:15 kernel: [811389.819307] br-vlan2159: port 2(w434-eth1) entering learning state
Jul 5 16:44:15 kernel: [811389.819536] br-vlan2159: port 1(eth0.2159) entering learning state
Jul 5 16:44:25 kernel: [811399.959567] br-vlan2159: no IPv6 routers present
If I delete the bridge and re-configure it, I see this when the bridge comes up again:
Jul 5 16:47:23 kernel: [811578.178683] br-vlan2159: port 4(w434-eth1) entering learning state
Jul 5 16:47:23 kernel: [811578.178917] br-vlan2159: port 3(eth0.2159) entering learning state
Jul 5 16:47:23 kernel: [811578.179146] br-vlan2159: port 2(d434-eth1) entering learning state
Jul 5 16:47:23 kernel: [811578.179374] br-vlan2159: port 1(b434-eth1) entering learning state
Jul 5 16:47:34 kernel: [811588.789566] br-vlan2159: no IPv6 routers present
Jul 5 16:47:38 kernel: [811593.178568] br-vlan2159: port 4(w434-eth1) entering forwarding state
Jul 5 16:47:38 kernel: [811593.178801] br-vlan2159: port 3(eth0.2159) entering forwarding state
Jul 5 16:47:38 kernel: [811593.179029] br-vlan2159: port 2(d434-eth1) entering forwarding state
Jul 5 16:47:38 kernel: [811593.179255] br-vlan2159: port 1(b434-eth1) entering forwarding state
After this the bridge and all interfaces connected to it are working as expected.
As it happens to all bridges at the same time I would not blame the brctl
tools for this but someting deeper inside the kernel. Since it happens at random and only every other month I have not the possibility to cross-check it with a newer kernel.
The main question (to my understanding) is: why is the bridge not entering forward state on the just added/set up ports?
Best Answer
What works for us if ports hang in learning:
Remove and add the interface from the bridge with 0 forwarding delay: