Breaking an RKE cluster in one easy step

With the release of Ubuntu’s latest LTS release (22.04, or “Jammy Jellyfish), I wanted to upgrade my Kubernetes nodes from 20.04 to 22.04. What I had hoped would be an easy endeavor turned out to be a weeks-long process with destroyed clusters and, ultimately, an ETCD issue.

The Hypothesis

As I viewed it, I had two paths to this upgrade: in-place upgrade on the nodes, or bring up new nodes and decommission the old ones. As the latter represents the “Yellowstone” approach, I chose that one. My plan seemed simple:

Spin up new Ubuntu 22.04 nodes using Packer.
Add the new nodes to the existing clusters, assigning the new nodes all the necessary roles (I usually have 1 control_plane, 3 etcd, and all are worker
Remove the control_plane from the old node and verify connectivity
Remove the old nodes (cordon, drain, and remove)

Internal Cluster

After updating my Packer scripts for 22.04, I spun up new nodes for my internal cluster, which has an ELK stack for log collection. I added the new nodes without a problem, and thought that maybe I could combine the last two steps and just remove all the nodes at the same time.

That ended up with the Rancher CLI getting stuck in checking ETCD health. I may have gotten a little impatient and just killed the Rancher CLI process mid-job. This left me with, well, a dead internal cluster. So, I recovered the cluster (see my note on cluster recovery below) and thought I’d try again with my non-production cluster.

Non-production Cluster

Some say that the definition of insanity is doing the same thing and expecting a different result. My logic, however, was that I made two mistakes the first time through:

Trying to remove the controlplane alongside of the etcd nodes in the same step
Killing the RKE CLI command mid-stream

So I spun up a few new non-production nodes, added them to the cluster, and simply removed controlplane from the old node.

Success! My new controlplane node took over, and cluster health seemed good. And, in the interest of only changing one variable at a time, I decided to try and remove just one old node from the cluster.

Kaboom….

Same issue in recovering the etcd volume. So I recovered the cluster and returned to the drawing board.

Testing

At this point, I only had my Rancher/Argo cluster and my production cluster, which houses, among other things, this site. I had no desire for wanton destruction of these clusters, so I setup a test cluster to see if I could replicate my results. I was able to, at which point I turned to the RKE project in Github for help.

After a few days, someone pointed me to a relatively new Rancher issue describing my predicament. If you read through those various issues, you’ll find that etcd 3.5 has an issue where node removal can corrupt its database, causing issues such as mine. The issue was corrected in 3.5.3.

I upgraded my RKE CLI and ran another test with the latest rancher Kubernetes version. This time, finally, success! I was able to remove etcd nodes without crashing the cluster.

Finishing up / Lessons Learned

Before doing anything, I upgraded all of my clusters to the latest supported Kubernetes version. In my case, this is v1.23.6-rancher1-1. Following the steps above, I was, in fact, able to progress through upgrading both my Rancher/Argo cluster and my production cluster without bringing down the clusters.

Lessons learned? Well, patience is key (don’t kill cluster management processes mid-effort), but also, sometimes it is worth a test before you try things. Had any of these clusters NOT been my home lab clusters, this process, seemingly simple, would have incurred downtime in more important systems.

A note on Cluster Recovery

For both the internal and non-production clusters, I could have scrambled to recover the ETCD volume for that cluster and brought it back to life. But I realized that there was no real valuable data in either cluster. The ELK logs are useful real-time but I have not started down the path of analyzing history, so I didn’t mind losing them. And even those are on my SAN, and the PVCs get archived when no longer in use.

Instead of a long, drawn out recovery process, I simply stood up brand new clusters, pointed my instance of Argo to them and updated my Argo applications to deploy to the new cluster. Inside of an hour, my apps were back up and running. This is something of a testament to the benefits of storing a cluster’s state in a repository: recreation was nearly automatic.