Why does bgp OPEN message get Connect Socket: Connection reset by peer when node is on a different subnet/gateway

bare-metalbgpcalicokubernetes

My network setup:

With this setup, only nodes on same subnet can establish bgp connection. Other nodes (that do a full 3 way tcp handshake), responds to hte OPEN message with [FIN, ACK] then a [RST] hence the Connection reset by peer message in my calicoctl node status <- is on controller 3 (10.0.3.100)

    IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.0.1.100   | node-to-node mesh | start | 07:12:01 | Connect Socket: Connection     |
|              |                   |       |          | closed                         |
| 10.0.2.100   | node-to-node mesh | start | 07:12:01 | Connect                        |
| 10.0.1.101   | node-to-node mesh | start | 07:12:01 | Connect Socket: Connection     |
|              |                   |       |          | reset by peer                  |
| 10.0.1.102   | node-to-node mesh | start | 07:12:01 | Connect Socket: Connection     |
|              |                   |       |          | reset by peer                  |
| 10.0.2.102   | node-to-node mesh | start | 07:12:01 | Connect Socket: Connection     |
|              |                   |       |          | reset by peer                  |
| 10.0.3.101   | node-to-node mesh | up    | 07:14:13 | Established                    |
| 10.0.3.102   | node-to-node mesh | up    | 07:12:02 | Established                    |
+--------------+-------------------+-------+----------+--------------------------------+

My wireshark dump of the handshake + OPEN message from controller 3 (10.0.3.100) to node4 (10.0.2.102)

Wireshark bgp trace between 10.0.3.100 and 10.0.2.102
Wireshark bgp trace between 10.0.0.4(10.0.3.100) and 10.0.2.102
Maybe the issue is that node 4 sees the data coming from 10.0.0.4 and not 10.0.3.100?

What works

Ping from all nodes to all nodes OK
nc port 179 to all nodes succeeds
Wireshark shows the full TCP handshake from controller 3 to node 4

Setup

Kubernetes 1.21.1 (installed via kubespray)
Calico 3.9 (default in kubespray)
All gateways are pfSense 2.5.x, the "master" gateway has static
routes for 10.0.1.0/24 via 10.0.0.2, 10.0.2.0/24 via 10.0.0.3 and
10.0.3.0/24 via 10.0.0.4.
Firewalls are disabled on the datacenter routers both on wan and lan No NAT is enabled on any of the pfSense boxes. (NAT for ipsec
vpn is on wan port for master gateway)
As far as I can tell i have full IP connectivity between all nodes in all subnets

Best Answer

I wrongly assumed pfSense Auto NAT was only for IPsec passtrough, when I disabled all outbound NAT rule generation it started working as intended. My fault for not understanding the setting in my pfSense routers.

Related Solutions

Docker – Kubernetes cluster internal routing not working (NodePort service)

Kubernetes requires more than just the nodes being able to talk to each other. It also requires a network (or routing table) so pods can talk to each other. It's essentially another network just for the pods (often called an overlay/underlay network) that allows pod on nodeA to talk to pods on nodeB.

From the looks of it you don't have pod networking set up. You can implement overlay networking a multitude of ways (which is one reason it's so confusing). Read more about the networking requirements here.

With only 2 nodes I would recommend you actually set up what I like to call "no SDN Kubernetes" and just manually add pod routes to each node. It would require you to do 2 things.

Specify the subnet for pods on each node
Manually run a command to create the route

I have details on how to do it on my blog post I wrote about the subject.

Unfortunately, setting up the pod networking is only going to get you 1/2 of the way there. In order to implement automatic NodePort services you also need to install the kube-proxy. The job of the kube-proxy is to watch for what port a service starts on and then route that port to the correct service/pod inside the cluster. It does this via IP tables and is mostly automatic.

I couldn't find a very good example of deploying kube-proxy manually (usually it's handled via your deployment tool) Here's an example of a DaemonSet the kubeadm tool should automatically create to run the kube-proxy on every node in the cluster.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  generation: 1
  labels:
    component: kube-proxy
    k8s-app: kube-proxy
    kubernetes.io/cluster-service: "true"
    name: kube-proxy
    tier: node
  name: kube-proxy
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: kube-proxy
      k8s-app: kube-proxy
      kubernetes.io/cluster-service: "true"
      name: kube-proxy
      tier: node
  template:
    metadata:
      labels:
        component: kube-proxy
        k8s-app: kube-proxy
        kubernetes.io/cluster-service: "true"
        name: kube-proxy
        tier: node
    spec:
      containers:
      - command:
        - kube-proxy
        - --kubeconfig=/run/kubeconfig
        image: gcr.io/google_containers/kube-proxy-amd64:v1.5.2
        imagePullPolicy: IfNotPresent
        name: kube-proxy
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        volumeMounts:
        - mountPath: /var/run/dbus
          name: dbus
        - mountPath: /run/kubeconfig
          name: kubeconfig
      dnsPolicy: ClusterFirst
      hostNetwork: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /etc/kubernetes/kubelet.conf
        name: kubeconfig
      - hostPath:
          path: /var/run/dbus
        name: dbus

One other resources that might be useful to go through is Kubernetes the Hard Way. It's not directly applicable to running in VMs on proxmox (it assumes GCE or AWS) but it shows you the bare minimum steps and resources needed to run a functioning Kubernetes cluster.

Kubernetes/Flannel doens’t work in private network

I was having the same problem. I believe the issue is that when the Internet connection goes away, the default route disappears and flannel can no longer bootstrap itself on that node. Just make sure that your nodes have a default route configured.

You can check by running:

$ ip route

If no default route is listed you can add one from the command-line like this:

$ ip route add default via <gateway_ip> dev <net_device>

where <gateway_ip> is the IP address of your "gateway" and <net_device> is eth0 (or whatever network device name is relevant in your case).