Nothing says “Friday Night Fun” like crashing an RKE Cluster!

Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

It all started with an upgrade…

All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

I lucked out

Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

Cleaning Up

Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

Lessons Learned

For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.