I am experiencing connection issues inside a LXC that are driving me mad. They are intermitent. They appear during some time, and they suddenly disapear.
Scenario
A lxc inside a host. Both are running Debian GNU/Linux 8.3
In the lxc there is an installation of Piwik (open source PHP software for stats, with apache, mysql) and an ssh server. The lxc apache is reachable through an nginx proxy in the host
The lxc config:
lxc.tty = 6
lxc.pts = 1024
lxc.rootfs = /var/lib/lxc/hammond/rootfs
lxc.cgroup.devices.deny = a
# /dev/null and zero
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
# consoles
lxc.cgroup.devices.allow = c 5:1 rwm
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 4:0 rwm
lxc.cgroup.devices.allow = c 4:1 rwm
# /dev/{,u}random
lxc.cgroup.devices.allow = c 1:9 rwm
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 136:* rwm
lxc.cgroup.devices.allow = c 5:2 rwm
# rtc
lxc.cgroup.devices.allow = c 254:0 rwm
# mounts point
lxc.mount.entry=proc /var/lib/lxc/hammond/rootfs/proc proc nodev,noexec,nosuid 0 0
lxc.mount.entry=devpts /var/lib/lxc/hammond/rootfs/dev/pts devpts defaults 0 0
lxc.mount.entry=sysfs /var/lib/lxc/hammond/rootfs/sys sysfs defaults 0 0
# networking
lxc.utsname = hammond
lxc.network.type = veth
#lxc.network.macvlan.mode = private
lxc.network.flags = up
lxc.network.link = br-hammond
lxc.network.ipv4 = 192.168.100.2/24
lxc.network.ipv4.gateway = 192.168.100.1
lxc.network.hwaddr = 00:1E:10:C1:6B:C9
lxc.start.auto = 1
# http://serverfault.com/questions/658052/systemd-journal-in-debian-jessie-lxc-container-eats-100-cpu
lxc.autodev = 1
lxc.kmsg = 0
Issues:
1. Cannot connect to local database
Suddenly, Piwik reports:
SQLSTATE[HY000] [2003] Can't connect to MySQL server on '127.0.0.1' (111)
The database is running, of course.
- If I telnet from inside the lxc (127.0.0.1:3306), I can connect to the database
- If I telnet the apache from inside the lxc (127.0.0.1:80), Piwik works fine. It connects to the database, renders the page as usual and doesn't report any error.
- If I telnet the apache from the host (192.168.100.2:80), Piwik reports the database error.
2. SSH freezes
I am tunneling the ssh connection to the lxc using ProxyCommand
ProxyCommand ssh -q host nc -q0 192.168.100.2 22
After the ssh negotiation phase, the connection freezes. If I type keys, they don't show up in the console. Finally, the connection timeouts with
packet_write_wait: Connection to UNKNOWN: Broken pipe
I have sniffed the packets with tcpdump and ssh key exchanges goes fine. Then, the traffic stops after 0.5 seconds
I think this is a bug in last Debian kernel updates. It used to work fine, but I am experiencing these problems since a few weeks ago. As I mention, they are intermitent. Suddenly, everything goes fine.
Suggestions on how to investigate further are welcomed
Best Answer
I've had a problem with the same symptoms. In my case, there was another host with the same IP on the vlan I used in the bridge. Sometimes the other host would be faster to answer to the ARP request (despite being another physical machine), at which point the lxc guest would save the wrong MAC address in its ARP table and continue sending ethernet frames to the wrong address until another ARP request "resolved" the problem.
I diagnosed this with a timestamped ping from host to guest:
as well as a tcpdump on both host and guest:
which allowed me to see that around the point when the network would drop out and when it would reactivate, ARP requests were being issued and answered. The ARP requests seemed to be in order (using the correct MACs), but i decided to check the facts as seen by the OS anyway, so I logged ARP tables on host and guest with timestamps:
which allowed me to understand that the host did not have a faulty MAC of the guest, but the guest somehow arrived at a faulty MAC of the host. Irritatingly, that was not reflected in the tcpdump information. (NB: there may be a race condition somewhere in libpcap or the ip stack that would benefit from investigating)
After finding the erroneous MAC, I looked up which vendor the erroneous MAC address belonged to, and thus was able to find the offending machine. If that information had been more ambiguous, I'm sure my switch would've had functionality to help me find the right switch port.
I suppose that up/downgrading kernels and certain userland tools would change and maybe even remove all or some of the symptoms through changed timings, slightly different behavior, other network services being active etc. For example, a ping from guest to host would reliably "fix" the problem in my case.
Also, do not forget that the IP addresses you can see with
ifconfig
are not all of the IP addresses used by the system.ip addr ls
will be more comprehensive on linux and maybe even some more advancediptables
configurations could play a role too. If you are in bad luck, the host responding to the arps may even have a broken IP stack. You may even get ARP replies from other customers of your ISP if your network isn't properly isolated.I realize that this might not be the exact solution to your problem, but I thought I'd leave some pointers for debugging for the next person to look for and find this issue on serverfault.