As part of our Kubernetes Subscription offering, the assigned CRE (Customer Reliability Engineer) will carry out Proof of Concepts for validating and developing projects that your team can implement against your Kubernetes Cluster. One of our Subscription customers, Sky Betting and Gaming, tasked us with investigating whether it was possible to migrate the CNI solution for a Kubernetes cluster from Canal to Cilium, live.
In this post we’ll discuss why one might want to change Kubernetes CNIs, what I have learnt developing a solution for live migration, and how it all works.
What is Kubernetes CNI, and why change it?
Container Network Interface (CNI) is a big topic, but in short, CNI is a set of specifications that define an interface used by container orchestrators to set up networking between containers. In the Kubernetes space, the Kubelet is responsible for calling the CNI installed on the cluster so Pods are attached to the Kubernetes cluster network during creation, and its resources are properly released during deletion. CNIs can also be responsible for more advanced features than just setting up routes in the cluster, such as network policy enforcement, encryption, load balancing etc.
There are many implementations of CNI for various use cases, each having their own advantages and disadvantages. Flannel is one such implementation, backed by iptables, is perhaps the most simple and popular in Kubernetes. Flannel is solely concerned with setting up routing between Pods in the Kubernetes cluster, which is achieved by creating an overlay network using the Virtual Extensible LAN (VXLAN) protocol. Another project, Calico, can be run as a stand alone CNI solution or instead, on top of Flannel (called Canal) to provide network policy enforcement.
Cilium is another CNI solution, based on eBPF, and designed to be run at large scale. Whilst Cilium implements the standard NetworkPolicies, it is able to utilize the full packet introspection of eBPF, enabling it to have first class support for Layer 7 policy for a number of protocols, custom extensions using envoy, large options for endpoint selection, as well as rich network monitoring introspection. For these reasons, Cilium becomes a very favorable choice for running Kubernetes at scale, with complex network policy requirements.
Since CNI underpins the entire network running on Kubernetes, it would seem that the only solution for changing a CNI is to take the entire cluster down, replace it, and bring up all workloads again on the new CNI. This of course causes downtime, or at the least, requires a full cluster migration. For some companies this might be unacceptable. So, how about a live migration instead?
Designing a live Kubernetes CNI migration
When a Pod is created, the installed CNI is called which will attach a network interface to the Pod. All going well, this network interface will join that Pod to the cluster network. The network interface shares the same life cycle as the Pod, meaning that only newly created Pods will be picked up by the newer CNI if it was swapped out. This requires that all Pods on the cluster that are a part of the cluster network be recycled in order to be a member of the new CNI.
A naive approach to migrating a Kubernetes CNI would be to gradually roll each node on the cluster, and then replace the CNI installed when the node is brought back up. This is however, not possible- the second CNI will be installed with a separate CIDR range to that of the currently installed CNI, which has no knowledge of the other network range. Not to mention the use of different encapsulation protocols, or broken network policy since identity is lost. This can be demonstrated below where Pods on Node 1 will be unable to communicate to Pods on Node 2, and vice versa.
Step 0: Single CNI installed on the cluster.
Step 1: Rollout the second CNI alongside the current. All pods communicate over the current.
Step 2: Both CNIs installed on all nodes, and Pods can communicate on either CNI.
Step 3: Peel away the first CNI. Pods can communicate on the new CNI if the first is unavailable at the source or destination Pod.
Step 4: First CNI is completely removed. All commication done over the new CNI.
This strategy maintains all network policy throughout the entire migration, and ensures no network downtime.
The caveat to this strategy is that all workloads on the cluster will need to be rolled several times. This should be acceptable. Workloads running on Kubernetes are expected to be resilient to service disruption and rescheduling, and with proper disruption budgets and probes, a conservative roll of the cluster at each migration step should ensure that the entire cluster remains healthy at all times.
Implementation of Kubernetes CNI
The end result of the project is a CLI tool that runs the migration from start to finish, allowing configuration of which steps to run, and ensuring cluster network health throughout the process. This is all configured using a config file and flags.
The CLI has support for running in a ‘dry mode’, and will ensure all previous steps have been completed before continuing. The full migration consists of 6 steps which I will run through below.
Step 0: Preflight
Before, during, and after the migration, we need to check that the cluster has full network connectivity in order to ensure there is no downtime. To do this, I decided to write a small service called knet-stress that regularly connects to the other knet-stress services on the cluster. It does this by looking up a Kubernetes Service, and sending an
HTTP GET request to every endpoint IP listed, including its own.
By running this as two DaemonSets under the same Service, we can cover Pod to Pod communication both between nodes, as well as Pods on the same node. The API lookup is also a good sanity check that the API server is still rotatable.
During the migration, we exec into each Pod and run a manual
status check, with the end result being that we have checked the bidirectional health of the network, across the entire cluster.
We run knet-stress before and after every migration step, and is our source of truth for network connectivity.
Step 1: Prepare
This step is for installing all our dependent resources, and labeling nodes.
The migration will swap out multiple DaemonSets at various steps, so we use Node Labels and Selectors to control the scheduling. Not only is this a good way for us to leave scheduling to Kubernetes, but it’s also a convenient way for us to observe where we are in the migration through a simple
$ kubectl get nodes. We also make sure that we patch the first CNI, Canal, to have a node selector that we can use to uninstall it at a later stage.
We define each label to use in the config, but are fine to leave as is.
When a node’s label is changed, Kubernetes will de-schedule any of the DaemonSets which don’t match the selector, and schedule those that do. We can then simply wait for the underlying Pods to become in a ready state.
Step 2: Roll all Nodes
The first challenge faced as part of the migration is how multiple CNIs can be installed at the same time whilst both servicing the same Pod. Luckily there is a project by Intel, multus-cni, which does exactly that. Multus is installed like most CNIs, as a DaemonSet, and acts like a middleman whereby calls made by the Kubelet are forwarded to each of the configured CNIs underneath, each setting up a separate network interface on the Pod. In practice, if multiple CNIs are installed and configured, each newly created Pod will have a ‘master’ network interface, as well as a secondary network interface, one for each of the extra CNIs (and of course the loopback device). The master network interface is what is advertised to Kubernetes and is what is labeled as the Pod’s IP used in Services etc. Although the secondary Pod IP is not directly advertised to Kubernetes, it is still rotatable by other containers on that secondary network. This will become important later.
Multus is easy to set up, all we need to do is provide it config for which CNIs we want to have installed. It is also possible to configure Multus using a CRD, though a blanket cluster wide configuration is more useful in our case.
In this config, we have defined
k8s-pod-network to be the master CNI network (the name of the CNI config for Canal) and it is the network that is advertised to Kubernetes. Cilium is our second CNI to be called, and this config should apply to all namespaces (
When CNIs are installed on a cluster, they are naturally run as a DaemonSet to set up networking on every node. With this, they also typically write out their executable to file, as well as their configuration to a well known directory location (typically somewhere like
/etc/kubernetes/cni/net.d/). The Kubelet looks at this well known directory location to check which executable to run, and what parameters to pass it.
Every time a new Pod is created, the Kubelet will list the files in the well known directory, and use the configuration which is alphabetically first in the list. For example,
00-multus.conflist is the CNI configuration that the Kubelet will use in the following example:
This strategy means that we can configure which CNI takes precedence by controlling the file locations that the CNI configuration is written to. Both Multus, and us during the migration, take advantage of this.
Cilium manages identity of endpoints in order to make routing and policy decisions. One issue that we ran into was Cilium losing the source identity of requests when the request was being sent as a non default, secondary network interface to the Pod. Cilium would interpret the source of that request as an identity of
reserved:world, which would then get dropped, never reaching the Pod. Although I didn’t track down what the source of the issue was, I was able to mitigate it by using the SBR (Source Based Routing) CNI meta plugin.
The SBR CNI meta plugin will cause the default route of the Pod to be overridden to the Cilium network interface. This ended up fixing the issue, and Cilium was able to correctly determine source endpoint identity.
Flannel and Cilium both have support, and are defaulted to use XLAN as the tunneling protocol to facilitate routing cluster traffic. VXLAN has support for running multiple networks on the same machines, separated by the VXLAN ID (VID), however I was not able to find an option for configuring this on either Cilium, nor Flannel. This caused a conflict and prevented Cilium from starting at all.
To fix this, I decided to configure Cilium to use GENEVE (Generic Network Virtualisation Encapsulation) instead. GENEVE is a newer encapsulation protocol designed to supersede VXLAN. This seemed like a fine compromise and I didn’t encounter any problems going this route.
Now we have Multus installed, and have given it the precedence as the primary CNI, we now require that all Pods which are in the cluster network to be rolled. As described earlier, this is a required step since we need all of the Pods to be recreated to have both CNI network interfaces attached.
Step 3: Change CNI Priority
This stage involves changing the CNI priority for all nodes so that Cilium becomes the primary. To achieve this, we install a new Multus DaemonSet that uses a new config which has the Cilium CNI as the primary, and Canal as the secondary.
To do this, we taint and drain the node, relabel it so that a new mutlus Pod is scheduled using the updated configuration, and then untaint the node. This causes all Pods to be rescheduled on that node which are now all members of both CNIs, but their advertised IP is of Cilium’s network. We repeat this process for all nodes in the cluster, ensuring that there is total network connectivity throughout.
The end result of this stage is that we have both CNIs installed for all nodes, and each Pod has two network interfaces attached, however the primary and advertised Pod IP address is now with the Cilium network.
Step 4: Migration
With all Pods now using the Cilium network as its primary, the Canal network is not being used at all and so can be safely uninstalled everywhere. We repeat a similar process as before; tainting and draining a node, relabelling it so that the old Canal and Cilium CNIs are uninstalled, installing a duplicated Cilium CNI that selects that node label, and brining the node back up.
Repeating this process, we finally get all Pods using the singular CNI installation of Cilium.
Step 5: Clean Up
This step is to simply remove all the resources no longer needed for the migration; the old Canal installation, unscheduled Multus and Cilium DaemonSets, and to patch the installed Cilium DaemonSet to no longer have the
migrated node label selector.
All going to plan, we have successfully migrated the Kubernetes CNI installation from Canal to Cilium, live.
This was a really interesting challenge to work on and complete as part of Subscription with Sky Betting and Gaming. If you would like to find out more about our Subscription offering and how it could help you, you can visit the page here and leave us a message.