Life of a Packet in Kubernetes — Part 2

As we discussed in Part 1, CNI plugins play an essential role in Kubernetes networking. There are many third-party CNI plugins available today; Calico is one of them. Many engineers prefer Calico; one of the main reasons is its ease of use and how it shapes the network fabric.

Calico supports a broad range of platforms, including Kubernetes, OpenShift, Docker EE, OpenStack, and bare metal services. The Calico node runs in a Docker container on the Kubernetes master node and on each Kubernetes worker node in the cluster. The calico-cni plugin integrates directly with the Kubernetes kubelet process on each node to discover which Pods are created and add them to Calico networking.

We will talk about installation, Calico modules (Felix, BIRD, and Confd), and routing modes.

What is not covered? Network policy — It needs a separate article, therefore skipping that for now.

Topics — Part 2

  1. Requirements
  2. Modules and its functions
  3. Routing modes
  4. Installation (calico and calicoctl)

CNI Requirements

  1. Create veth-pair and move the same inside container
  2. Identify the right POD CIDR
  3. Create a CNI configuration file
  4. Assign and manage IP address
  5. Add default routes inside the container
  6. Advertise the routes to all the peer nodes (Not applicable for VxLan)
  7. Add routes in the HOST server
  8. Enforce Network Policy

There are many other requirements too, but the above ones are the basic. Let’s take a look at the routing table in the Master and Worker node. Each node has a container with an IP address and default container route.

By seeing the routing table, it is evident that the Pods can talk to each other via the L3 network as the routes are perfect. What module is responsible for adding this route, and how it gets to know the remote routes? Also, why there is a default route with gateway 169.254.1.1? We will talk about that in a moment.

The core components of Calico are Bird, Felix, ConfD, Etcd, and Kubernetes API Server. The data-store is used to store the config information(ip-pools, endpoints info, network policies, etc.). In our example, we will use Kubernetes as a Calico data store.

BIRD (BGP)

The bird is a per-node BGP daemon that exchanges route information with BGP daemons running on other nodes. The common topology could be node-to-node mesh, where each BGP peers with every other.

For large scale deployments, this can get messy. There are Route Reflectors for completing the route propagation (Certain BGP nodes can be configured as Route Reflectors) to reduce the number of BGP-BGP connections. Rather than each BGP system having to peer with every other BGP system with the AS, each BGP speaker instead peers with a router reflector. Routing advertisements sent to the route reflector are then reflected out to all of the other BGP speakers. For more information, please refer to the RFC4456.

The BIRD instance is responsible for propagating the routes to other BIRD instances. The default configuration is ‘BGP Mesh,’ and this can be used for small deployments. In large-scale deployments, it is recommended to use a Route reflector to avoid issues. There can be more than one RR to have high availability. Also, external rack RRs can be used instead of BIRD.

ConfD

ConfD is a simple configuration management tool that runs in the Calico node container. It reads values (BIRD configuration for Calico) from etcd, and writes them to disk files. It loops through pools (networks and subnetworks) to apply configuration data (CIDR keys), and assembles them in a way that BIRD can use. So whenever there is a change in the network, BIRD can detect and propagate routes to other nodes.

Felix

The Calico Felix daemon runs in the Calico node container and brings the solution together by taking several actions:

  • Reads information from the Kubernetes etcd
  • Builds the routing table
  • Configures the IPTables (kube-proxy mode IPTables)
  • Configures IPVS (kube-proxy mode IPVS)

Let’s look at the cluster with all Calico modules,

Something looks different? Yes, the one end of the veth is dangling, not connected anywhere; It is in kernel space.

How the packet gets routed to the peer node?

  1. Pod in master tries to ping the IP address 10.0.2.11
  2. Pod sends an ARP request to the gateway.
  3. Get’s the ARP response with the MAC address.
  4. Wait, who sent the ARP response?

What’s going on? How can a container route at an IP that doesn't exist? Let’s walk through what’s happening. Some of you reading this might have noticed that 169.254.1.1 is an IPv4 link-local address. The container has a default route pointing at a link-local address. The container expects this IP address to be reachable on its directly connected interface, in this case, the containers eth0 address. The container will attempt to ARP for that IP address when it wants to route out through the default route.

If we capture the ARP response, it will show the MAC address of the other end of the veth (cali123). So you might be wondering how on earth the host is replying to an ARP request for which it doesn’t have an IP interface. The answer is proxy-arp. If we check the host side VETH interface, we’ll see that proxy-arp is enabled.

“Proxy ARP is a technique by which a proxy device on a given network answers the ARP queries for an IP address that is not on that network. The proxy is aware of the location of the traffic’s destination, and offers its own MAC address as the (ostensibly final) destination.[1] The traffic directed to the proxy address is then typically routed by the proxy to the intended destination via another interface or via a tunnel. The process, which results in the node responding with its own MAC address to an ARP request for a different IP address for proxying purposes, is sometimes referred to as publishing”

Let’s take a closer look at the worker node,

Once the packet reaches the kernel, it routes the packet based on routing table entries.

Incoming traffic

  1. The packet reaches the worker node kernel.
  2. Kernel puts the packet into the cali123.

Routing Modes

Calico supports 3 routing modes; in this section, we will see the pros and cons of each method and where we can use them.

  • IP-in-IP: default; encapsulated
  • Direct/NoEncapMode: unencapsulated (Preferred)
  • VXLAN: encapsulated (No BGP)

IP-in-IP (Default)

IP-in-IP is a simple form of encapsulation achieved by putting an IP packet inside another. A transmitted packet contains an outer header with host source and destination IPs and an inner header with pod source and destination IPs.

Azure doesn’t support IP-IP (As far I know); therefore, we can’t use IP-IP in that environment. It’s better to disable IP-IP to get better performance.

NoEncapMode

In this mode, send packets as if they came directly from the pod. Since there is no encapsulation and de-capsulation overhead, direct is highly performant.

Source IP check must be disabled in AWS to use this mode.

VXLAN

VXLAN routing is supported in Calico 3.7+.

VXLAN stands for Virtual Extensible LAN. VXLAN is an encapsulation technique in which layer 2 ethernet frames are encapsulated in UDP packets. VXLAN is a network virtualization technology. When devices communicate within a software-defined Datacenter, a VXLAN tunnel is set up between those devices. Those tunnels can be set up on both physical and virtual switches. The switch ports are known as VXLAN Tunnel Endpoints (VTEPs) and are responsible for the encapsulation and de-encapsulation of VXLAN packets. Devices without VXLAN support are connected to a switch with VTEP functionality. The switch will provide the conversion from and to VXLAN.

VXLAN is great for networks that do not support IP-in-IP, such as Azure or any other DC that doesn’t support BGP.

Demo — IPIP and UnEncapMode

Check the cluster state before the Calico installation.

Check the CNI bin and conf directory. There won’t be any configuration file or the calico binary as the calico installation would populate these via volume mount.

Check the IP routes in the master/worker node.

Download and apply the calico.yaml based on your environment.

Let’s take a look at some useful configuration parameters,

Check POD and Node status after the calico installation.

Explore the CNI configuration as that’s what Kubelet needs to set up the network.

Check the CNI binary files,

Let’s install the calicoctl to give good information about the calico and let us modify the Calico configuration.

Check BGP peer status. This will show the ‘worker’ node as a peer.

Create a busybox POD with two replicas and master node toleration.

Get Pod and endpoint status,

Get the details of the host side veth peer of master node busybox POD.

Get the details of the master Pod’s interface,

Get the master node routes,

Let’s try to ping the worker node Pod to trigger ARP.

The MAC address of the gateway is nothing but the cali9861acf9f07. From now, whenever the traffic goes out, it will directly hit the kernel; And, the kernel knows that it has to write the packet into the tunl0 based on the IP route.

Proxy ARP configuration,

How the destination node handles the packet?

Upon receiving the packet, the kernel sends the right veth based on the routing table.

We can see the IP-IP protocol on the wire if we capture the packets. Azure doesn’t support IP-IP (As far I know); therefore, we can’t use IP-IP in that environment. It’s better to disable IP-IP to get better performance. Let’s try to disable and see what’s the effect.

Disable IP-IP

Update the ipPool configuration to disable IPIP.

Open the ippool.yaml and set the IPIP to ‘Never,’ and apply the yaml via calicoctl.

Recheck the IP route,

The device is no more tunl0; it is set to the management interface of the master node.

Let’s ping the worker node POD and make sure all works fine. From now, there won’t be any IPIP protocol involved.

Note: Source IP check should be disabled in AWS environment to use this mode.

Demo — VXLAN

Re-initiate the cluster and download the calico.yaml file to apply the following changes,

  1. Remove bird from livenessProbe and readinessProbe

2. Change the calico_backend to ‘vxlan’ as we don’t need BGP anymore.

3. Disable IPIP

Let’s apply this new yaml.

Get the POD status,

Ping the worker node POD from

Trigger the ARP request,

The concept is as the previous modes, but the only difference is that the packet reaches the vxland, and it encapsulates the packet with node IP and its MAC in the inner header and sends it. Also, the UDP port of the vxlan proto will be 4789. The etcd helps here to get the details of available nodes and their supported IP range so that the vxlan-calico can build the packet.

Note: VxLAN mode needs more processing power than the previous modes.

Disclaimer

This article does not provide any technical advice or recommendation; if you feel so, it is my personal view, not the company I work for.

References