Migrating applications between Kubernetes clusters

Luca Prete
Google Cloud - Community
8 min readJan 11, 2021

--

TL;DR: Migrating applications gradually between Kubernetes deployments so that intra-cluster traffic is not affected, can be challenging. This article proposes an approach that makes use of standard but powerful tools.

Raise the hand who has never had to migrate applications between two Kubernetes clusters!

Easy, right? Well, not really. Not always.

It may happen for a few reasons: for example, it might be due to some major platform upgrades, or as a result of a consistent resource reorganisation.

While in principle this would seem quite a straight-forward task, the process may get more complex, as dependencies between applications increase, and requirements for lower downtime during the migration arise.

A first approach may consist in recreating the same environment in the new cluster, and eventually switching any external DNS record, once the new deployment is ready. Sadly, this is not always possible.

A simple use-case will give us a better understanding of the issue and how it can be solved.

Let’s imagine a Kubernetes cluster running two applications; one depending on the other. They are owned by different teams, who can migrate them at different paces.

How would you migrate the applications gradually, avoiding downtimes?

Before moving forward, let’s point out a couple of fundamental, firm assumptions, that are not strictly related to the migration, but will definitely lower the overall cluster maintenance efforts, and straighten the migration process:

  • Applications should always reference both internal and external services using DNS names, rather than IP addresses
  • It’s important to have a clear view of the relationships between our applications. It will help us to proactively prepare the migration and react in time to possible failures.

Now that we stated the basic rules of the game, let’s get back to our example.

How can we move one application to the new cluster, so that the other doesn’t even know that its dependency has been migrated and avoid malfunctions?

When I started scratching the surface of the problem, some terms, like multi-cluster, cluster federation and service mesh, popped up in my mind.

There are a bunch of tools out there to build multi-cluster environments, create cluster federations and efficiently manage “microservices networks”. Istio is just one out of many.

Great, it seems we made it! …Well, not really.

Most of the multi-cluster and federation tools come with important caveat: the two Kubernetes clusters need to run a very similar, if not the same, version of Kubernetes, and this is definitely not always the case. Moreover, many users may find using these tools an overkill and something risky to inject in their systems, for the only sake of a migration.

This article follows a more conservative approach, leveraging a set of basic but powerful tools, such as DNS, Kubernetes services and, optionally, an ingress controller.

My Environment

I’ll use Google Cloud Platform (GCP) for the demonstration. This will include a couple of Kubernetes clusters (together with some load balancers, provisioned as we create our services), and Cloud DNS.

I will also make use of Traefik as an ingress controller. Although it is not mandatory, it’s strongly recommended, as it will avoid creating single LoadBalancer services (and related internal GCP load balancers) for each application to be migrated. Things will get clearer as soon as we’ll go through the steps. Anyway, keep in mind you’ll be able to optionally substitute Traefik with your favorite Ingress.

Demo Time!

I’ve prepared two empty clusters: cluster-old and cluster-new.

We’ll start deploying two applications in the old cluster: app1 and app2. To make the experiment somehow more intriguing, the two apps will be deployed in different namespaces: respectively app1-ns and app2-ns.

Each application is composed of a pod and a namesake ClusterIP service that references it.

App1 is a dumb HTTP server, that returns at its root, on port 80, the message "Hello, I’m app1".

Here is the Kubernetes manifest I’ve used to create the pod:

# app1-pod.yamlkind: Pod
apiVersion: v1
metadata:
name: app1
labels:
app: app1
spec:
containers:
- name: app1
image: hashicorp/http-echo:0.2.3
args:
- "-text=Hello, I'm app1"
- "-listen=:80"

And here is the one for the service:

# app1-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
selector:
app: app1
ports:
- port: 80

App2 is a very simple client: it queries app1 every second, using curl. Writing replies received to the system standard output, we’ll be able to see them querying the container logs.

# app2-pod.yamlkind: Pod
apiVersion: v1
metadata:
name: app2
labels:
app: app2
spec:
containers:
- name: app2
image: curlimages/curl:7.74.0
command: ["/bin/sh", "-c"]
args:
- >
while true; do
curl -s -X GET http://app1.app1-ns
sleep 1
done

Let’s create the namespaces and deploy the applications.

kubectl create namespace app1-ns
kubectl apply -f app1-pod.yaml --namespace app1-ns
kubectl apply -f app1-svc.yaml --namespace app1-ns
kubectl create namespace app2-ns
kubectl apply -f app2-pod.yaml --namespace app2-ns

The goal is to migrate these applications from the old cluster to the new cluster, where only the two empty namespaces have been created.

No need to say that the most critical part is migrating app1, so that app2 doesn’t even notice that the former has been moved over.

At a high level, the idea is to create a new service in the old cluster that, instead of pointing to the local app1 application will link -through an external DNS entry- to a clone of it, living in the new cluster.

Along the way, we’ll always be able to test if the new components created work, before modifying any existing routing.

Let’s start implementing the machinery, making sure the mechanism works within the same cluster first: app2 will soon communicate with app1 through the external DNS.

Installing Traefik is the first step. I won’t go into details, since this is not the goal of the article. I’ve simply followed the official helm installation guide, and added an annotation to create a GCP internal load balancer, instead of the default HTTP global load balancer, not really needed for this experiment.

kubectl create namespace traefik

helm repo add traefik https://helm.traefik.io/traefik
helm repo update
helm install \
--set service.annotations."cloud\.google\.com/load-balancer-type"=Internal \
--namespace traefik \
traefik \
traefik/traefik

In a few seconds you should see the private IP address allocated for the Traefik LoadBalancer service.

kubectl get services -n traefik

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 10.72.4.210 192.168.100.16 80:31679/TCP,443:31228/TCP 45s

Make a note of it. We’ll need it soon.

Moving to Cloud DNS, create a private DNS zone. For example, mycompany.internal

Creating a DNS private zone in Cloud DNS

Create an A record, old.mycompany.internal, pointing to the LoadBalancer IP just allocated.

Once finished, your DNS panel should look as follows.

In the old cluster:

Make a copy of your app1 service. Call it, for example, app1-internal.

# app1-internal-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1-internal
spec:
selector:
app: app1
ports:
- port: 80

Create a Traefik IngressRoute.

# app1-ingress-route.yamlapiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: app1
spec:
entryPoints:
- web
routes:
- kind: Rule
match: Host(`app1.app1-ns`) || Host(`app1.app1-ns.svc.local`)
services:
- kind: Service
name: app1-internal
namespace: app1-ns
port: 80

Notice that we match either on the destination hostnames app1.app1-ns or app1.app1-ns.svc.cluster.local. This is because, beside the different strategies we will put in place, requests will still reach app1 using the original destination host header.

Let’s deploy the two components:

kubectl apply -f app1-internal-svc.yaml --namespace app1-ns
kubectl apply -f app1-ingress-route.yaml --namespace app1-ns

It’s time to verify that our application is reachable through the new path. To do so, I’ve entered into the app2 client and manually curl app1 through the new address.

kubectl exec -it -n app2-ns app2 /bin/sh

curl -H “Host: app1.app1-ns” http://old.mycompany.internal
Hello, I'm app1

Notice that I specify a Host header. Without doing this, you would receive a “404: Page not found.” reply from Traefik.

Create an ExternalName service, pointing to old.mycompany.internal.

app1-ext-old-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
externalName: old.mycompany.internal
type: ExternalName

Finally, replace the original app1 service in app1-ns with the one just created.

kubectl replace -f Desktop/medium/app1-ext-old-svc.yaml -n app1-ns

During the experiment, I’ve kept logs going on app2, and I’ve never visibly noticed app1 stop answering.

App1 is still living in the old cluster, but from now on, client requests go to the ExternalName service, out of the cluster to Cloud DNS, back into Traefik, through the new app1-internal service, and finally to the pod.

The Migration

Now that the mechanics we had in mind works, we are ready for the migration.

Let’s setup the new cluster. As we did in the old one:

  • Deploy app1 and app2 in their namespaces (same commands as above)
  • Deploy Traefik (same commands as above). Get the new internal load balancer IP
  • Create another A record in mycompany.internal. Call it new.mycompany.internal and point to the new cluster ingress IP
  • Deploy the same Traefik IngressRoute deployed in cluster-old
  • Deploy the app1-internal service in the app1-ns namespace (same commands as above)

Create a new ExternalName service pointing to the DNS name just created.

# app1-ext-new-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
externalName: new.mycompany.internal
type: ExternalName

Finally, let’s replace the old ExternalName service in the old cluster with the one just created:

kubectl replace -f Desktop/medium/svc-ext-name-new.yaml -n app1-ns

With basically no downtime, your app2 client in the old cluster will start communicating with the app1 application living in the new cluster.

App2 in the old cluster communicates with app1 in the new cluster

Notice the process was completely transparent for app2, as we haven’t changed any reference to app1 in it.

We can now repeat the same process for app2, thus completing the migration.

App has been moved the new cluster. It’s now be able to reference app1 as it was doing before the migration

Once the migration is complete:

  • The app1 service can be converted back to a ClusterIP service
  • The old cluster can be removed
  • The Cloud DNS zone, Traefik and the app1-internal service in the new cluster can be deleted
The components used for the migration and the old cluster are removed. The migration is complete

Although this is a minimal setup, it should be fairly easy to extend the same process to larger deployments and possibly automate it to be applied at scale. But this is another topic!

Enjoy!

Thank you Ludovico Magnocavallo for sharing your ideas and helping me with this article!

--

--

Luca Prete
Google Cloud - Community

Strategic Cloud Engineer (former Cloud Consultant) at Google. Deployment engineer, DevOps. Working on systems and networks. SDN specialist.