Monday 17 May 2021

Calico Node, more like Calico No

 I spent a happy few hours over the weekend, trying to work out why my Kubernetes 1.21 cluster wasn't behaving as expected.

I was seeing a bunch o' weirdness whereby certain pods weren't able to access certain services, manifested specifically when I was trying/failing to create a DataVolume using the KubeVirt Containerised Data Importer (CDI) capability.

I was seeing exceptions such as: -

Error from server (InternalError): error when creating "create_volume.yaml": Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": Post "https://cdi-api.cdi.svc:443/datavolume-mutate?timeout=30s": dial tcp 10.102.58.243:443: i/o timeout

from: -

kubectl apply -f create_volume.yaml

After much digging and DNS debugging including using BusyBox to resolve various K8s services: -

kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup cdi-api.cdi

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      cdi-api.cdi
Address 1: 10.102.58.243 cdi-api.cdi.svc.cluster.local
pod "busybox" deleted

kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup kubernetes.default

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted

which looked OK, something inspired me to look at Calico Node, which is the networking layer overlaying my cluster: -

kubectl get pods -A|grep calico

kube-system   calico-kube-controllers-bf965bfd8-hg82b          1/1     Running   0          58m
kube-system   calico-node-8zkvt                                0/1     Running   0          7m24s
kube-system   calico-node-srmj6                                0/1     Running   0          7m47s

Noticing that both calico-node pods were showing 0/1 rather than 1/1, meaning that they weren't running on the Compute node, I dug further: -

kubectl describe pod `kubectl get pods -A|grep calico-node|awk '{print $2}'` --namespace kube-system

which, in part, showed: -

  Warning  Unhealthy  9m44s  kubelet            Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  9m42s  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  9m32s  kubelet            Readiness probe failed: 2021-05-17 11:33:08.203 [INFO][197] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  9m22s  kubelet  Readiness probe failed: 2021-05-17 11:33:18.195 [INFO][231] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  9m12s  kubelet  Readiness probe failed: 2021-05-17 11:33:28.278 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  9m2s  kubelet  Readiness probe failed: 2021-05-17 11:33:38.334 [INFO][301] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  8m52s  kubelet  Readiness probe failed: 2021-05-17 11:33:48.182 [INFO][319] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  8m42s  kubelet  Readiness probe failed: 2021-05-17 11:33:58.266 [INFO][356] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  8m32s  kubelet  Readiness probe failed: 2021-05-17 11:34:08.185 [INFO][377] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
  Warning  Unhealthy  4m42s (x23 over 8m22s)  kubelet  (combined from similar events): Readiness probe failed: 2021-05-17 11:37:58.234 [INFO][1014] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Knowing that my firewall configuration - iptables - was clean n' green, in that I'd opened up the Border Gateway Protocol (BGP) port 179 on both the Control Plane and Compute nodes: -

iptables -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT

I looked back through my notes, and remembered the issue with IP_AUTODETECTION_METHOD and the Calico Node daemonset.

I checked the daemonset: -

kubectl get daemonset -A

NAMESPACE     NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   calico-node    2         2         0       2            0           kubernetes.io/os=linux   64m
kube-system   kube-proxy     2         2         2       2            2           kubernetes.io/os=linux   112m
kubevirt      virt-handler   1         1         1       1            1           kubernetes.io/os=linux   30m

and noticed that the calico-node daemonset was, like the pods, showing as unready ( 0 instead of 1+ in the READY column )

I inspected the offending daemonset: -

kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'

{
  "name": "IP",
  "value": "autodetect"
}

noting that the IP_AUTODETECTION_METHOD environment variable wasn't specifically set.

Given that the VMs that host my K8s nodes have TWO network adapters, eth0 and eth1, and that I want Calico Node to use eth0 which is the private IP, I explicitly set that: -

kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth0

daemonset.apps/calico-node env updated

and then validated the change: -

kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'

{
  "name": "IP",
  "value": "autodetect"
}
{
  "name": "IP_AUTODETECTION_METHOD",
  "value": "interface=eth0"
}

More importantly, the Calico Node pods are happy: -

kubectl get pods -A

NAMESPACE     NAME                                             READY   STATUS    RESTARTS   AGE
cdi           cdi-apiserver-6b87945b8d-dww25                   1/1     Running   0          120m
cdi           cdi-deployment-86c6d76d98-7cxlv                  1/1     Running   0          120m
cdi           cdi-operator-5757c84894-xhw6r                    1/1     Running   0          120m
cdi           cdi-uploadproxy-79dd97b4d5-lvd72                 1/1     Running   0          120m
kube-system   calico-kube-controllers-bf965bfd8-hg82b          1/1     Running   0          154m
kube-system   calico-node-8llqm                                1/1     Running   0          75m
kube-system   calico-node-j9rdb                                1/1     Running   0          75m
kube-system   coredns-558bd4d5db-fml9w                         1/1     Running   0          3h21m
kube-system   coredns-558bd4d5db-gg8dm                         1/1     Running   0          3h21m
kube-system   etcd-grouched1.fyre.ibm.com                      1/1     Running   0          3h22m
kube-system   kube-apiserver-grouched1.fyre.ibm.com            1/1     Running   0          3h22m
kube-system   kube-controller-manager-grouched1.fyre.ibm.com   1/1     Running   1          3h22m
kube-system   kube-proxy-47txj                                 1/1     Running   0          3h19m
kube-system   kube-proxy-hg7f8                                 1/1     Running   0          3h21m
kube-system   kube-scheduler-grouched1.fyre.ibm.com            1/1     Running   0          3h22m
kubevirt      virt-api-58999dff54-c8mch                        1/1     Running   0          120m
kubevirt      virt-api-58999dff54-gs8pm                        1/1     Running   0          120m
kubevirt      virt-controller-5c68c56896-l2rp7                 1/1     Running   0          120m
kubevirt      virt-controller-5c68c56896-phrt9                 1/1     Running   0          120m
kubevirt      virt-handler-85dhc                               1/1     Running   0          120m
kubevirt      virt-operator-78f65c88d4-ldtgj                   1/1     Running   0          123m
kubevirt      virt-operator-78f65c88d4-tmxhs                   1/1     Running   0          123m

as is the daemonset: -

kubectl get daemonset -A

NAMESPACE     NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   calico-node    2         2         2       2            2           kubernetes.io/os=linux   154m
kube-system   kube-proxy     2         2         2       2            2           kubernetes.io/os=linux   3h23m
kubevirt      virt-handler   1         1         1       1            1           kubernetes.io/os=linux   120m

and I can now create my DataVolume: -

kubectl apply -f create_volume.yaml

datavolume.cdi.kubevirt.io/registry-image-datavolume created

2 comments:

Unknown said...

Hello Dave

Thank you very much for this blog post, I had been stuck on this problem for several weeks now, I was unable to span my k8s cluster across multiple networks.

Thanks again

Dave Hay said...

Awesome news, thanks for letting us know

Visual Studio Code - Wow 🙀

Why did I not know that I can merely hit [cmd] [p]  to bring up a search box allowing me to search my project e.g. a repo cloned from GitHub...