I’ll take ‘Compatibility Matrices’ for $1000, Alex…

I have posted a lot on Kubernetes over the past weeks. I have covered a lot around tools for observability, monitoring, networking, and general usefulness. But what I ran into over the past few weeks is both infuriating and completely avoidable, with the right knowledge and time to keep up to speed.

The problem

I do my best to try and keep things up to date in the lab. Not necessarily bleeding edge, but I make an effort to check for new versions of my tools and update them. This includes the Kubernetes versions of my clusters.

Using RKE, I am limited to what RKE supported. So I have been running 1.23.6 (or, more specifically, v1.23.6-rancher1-1) for at least a month or so.

About two weeks ago, I updated my RKE command line and noticed that v1-23.8-rancher1-1 was available. So I changed the cluster.yml in my non-production environment and ran rke up. No errors, no problems, right?? So I made the change to the internal cluster and started the rke up for that cluster. As that was processing, however, I noticed that my non-production cluster was down. Like, the Kube API was running, so I could get the pod list. But every pod was erroring out. I did not notice anything in the pod events that was even remotely helpful, other than it couldn’t be scheduled. As I didn’t have time to properly diagnose, I attempted to roll the cluster back to 1.23.6. That worked, so I downgraded and left it alone.

I will not let the machines win, so I stepped back into this problem today. I tried upgrading again (both to 1.23.8 and 1.24.2), with the same problems. In digging into the docker container logs on the hosts, I found the smoking gun:

Unit kubepods-burstable.slice not found.

Eh? I can say I have never seen that phrase before. But a quick Google search pointed me towards cgroup and, more generally, the docker runtime.

Compatibility, you fickle fiend

As it turns out, there is quite a large compatibility matrix between Rancher, RKE, Kubernetes, and Docker itself. My move to 1.23.8 clearly pushed my Kubernetes version past what my Docker version is (which, if you care, was Docker 20.10.12 on Ubuntu 22.04).

I downgraded the non-production cluster once again, then upgraded Docker on those nodes (a simple sudo apt upgrade docker). Then I tried the upgrade, first to v1.23.8 (baby steps, folks).

Success! Kubernetes version was upgraded, and all the pods restarted as expected. Throwing caution to the wind, I upgraded the cluster to 1.24.2. And, this time, no problems.

Lessons Learned

Kubernetes is a great orchestration tool. Coupled with the host of other CNCF tooling available, one can design and build robust application frameworks which let developers focus more on business code than deployment and scaling.

But that robustness comes with a side of complexity. The host of container runtimes, Kubernetes versions, and third party hosts means cluster administration is, and will continue to be, a full time job. Just look at the Rancher Manager matrix at SUSE. Thinking about keeping production systems running while jugging the compatibility of all these different pieces makes me glad that I do not have to manage these beasts daily.

Thankfully, cloud providers like Azure, GCP, and AWS, provide some respite by simplifying some of this. Their images almost force compatibility, so that one doesn’t run into the issues that I ran into on my home cluster. I am much more confident in their ability to run a production cluster than I am in my own skills as a Kubernetes admin.