Lessons in Managing my Kubernetes Cluster: Man Down!

I had a bit of a panic this week as routine tasks took me down a rabbit hole in Kubernetes. The more I manage my home lab clusters, the more I realize I do not want to be responsible for bare metal clusters at work.

It was a typical upgrade…

With ArgoCD in place, the contents of my clusters is very neatly defined in my various infrastructure repositories. I even have a small Powershell script that checks for the latest versions of the tools I have installed and updates my repositories with the latest and greatest.

I ran that script today and noticed a few minor updates to some of my charts, so I figured I would apply those at lunch. Pretty typically, it is a very smooth application and the updates are applied in a few minutes.

However, after having ArgoCD sync, I realized that the Helm charts I was upgrading was stuck in a “Progressing” state. I checked the pods, and they were in a perpetual “Pending” state.

My typical debugging found nothing: there were no events around the pod being unable to be scheduled for any particular reason, and there was no log file, since the pods were not even being created.

My first Google searches suggested a problem with persistent volumes/persistent volume claims. So I poked around in those, going so far as deleting them (after backing up the folders in my NFS target), but, well, no luck.

And there it was…

As it was “unschedulable,” I started trying to find the logs for the scheduler. I could not find them. So I logged in to my control plane node to see if the scheduler was even running… It was not.

I checked the logs for that container, and nothing stood out. It just kind of “died.” I tried restarting the Docker container…. Nope. I even tried re-running the rke up command for that cluster, to see if Rancher Kubernetes Engine would restart it for me…. Nope.

So, running out of options, I changed my cluster.yml configuration file to add a control plane role to another node in the cluster and re-ran rke up. And that, finally, worked. At that point, I removed the control plane role from the old control plane node and modified the DNS entries to point my hostnames to the new control plane node. With that, everything came back up.

Monitoring?

I wanted to write an alert in Mimir to catch this so I would know about it before I dug around. It was at this point I realized that I am not collecting any metrics from the Kubernetes components themselves. And, frankly, I am not sure how. RKE installs metrics-server by default, but I have not found a way to scrape metrics from Kubernetes components.

My Google searches have been fruitless, and it has been a long work week, so this problem will most likely have to wait for a bit. If I come up with something I will update this post.