Tag: RKE

Moving On: Testing RKE2 Clusters in the Home Lab
I spent the better part of the weekend recovering from crashing my RKE clusters last Friday. This put me on a path towards researching new Kubernetes clusters and determining the best path forward for my home lab.

Intentionally Myopic

Let me be clear: This is a home lab, whose purpose is not to help be build bulletproof, corporate production-ready clusters. I also do not want to run Minikube on a box somewhere. So, when I approached my “research” (you will see later why I put that term in quotes), I wanted to make sure I did not get bogged down in the minutia of different Kubernetes installs or details. I stuck with Rancher Kubernetes Engine (RKE1) for a long time because it was quick to stand up, relatively stable, and easy to manage.

So, when I started looking for alternatives, my first research was into whether Rancher had any updated offerings. And, with that, I found RKE2.

RKE2, aka RKE Government

I already feel safer knowing that RKE2’s alter ego is RKE Government. All joking aside, as I dug into RKE2, it seemed a good mix of RKE1, which I am used to, and K3s, a lightweight implementation of Kubernetes. The RKE2 documentation was, frankly, much more intuitive and easy to navigate than the RKE1 documentation. I am not sure if it is because the documentation is that much better or if because RKE2 is that much easier to configure.

I could spend pages upon pages explaining the experiments I ran over the last few evenings, but the proof is in the pudding, as they say. My provisioning-projects repository has a new Powershell script (Create-Rke2Cluster.ps1) that outlines the steps needed to get a cluster configured. My work, then, came down to how I wanted to configure the cluster.

RKE1 Roles vs RKE2 Server/Agent

RKE1 had a notion of node roles which were divided into three categories:
- controlplane – Nodes with this role hosts the Kubernetes APIs
- etcd – Nodes with this role host the etcd storage containers. There should be an odd number, at least 3 is a good choice.
- worker – Nodes with this role can run workloads within the cluster.
My RKE1 clusters typically have the following setup:
- One node with controlplane, etcd, and worker roles.
- Two nodes with etcd and worker roles.
- If needed, additional nodes with just the worker role.
This seemed to work well: I had proper redundancy with etcd and enough workers to host all of my workloads. Sure, I only had one control plane, so if that node went down, well, the cluster would be in trouble. However, I usually did not have much problem with keeping the nodes running so I left it as it stood.

With RKE2, there is simply a notion of server and agent. The server node runs etcd and the control plane components, while agents run only user defined workloads. So, when I started planning my RKE2 clusters, I figured I would run one server and two agents. The lack of etcd redundancy would not have me losing sleep at night, but I really did not want to run 3 servers and then more agents for my workloads.

As I started down this road, I wondered how I would be able to cycle nodes. I asked the #rke2 channel on rancher-users.slack.com, and got an answer from Brad Davidson: I should always have at least 2 available servers, even when cycling. However, he did mention something that was not immediately apparent: the server can and will run user-defined workloads unless the appropriate taints have been applied. So, in that sense, an RKE2 server acts similarly to my “all roles” node, where it functions as a control plane, etcd, and worker node.

The Verdict?

Once I saw a path forward with RKE2, I really have not looked back. I have put considerable time into my provisioning projects scripts, as well as creating a new API wrapper for Windows DNS management (post to follow).

“But Matt, you haven’t considered Kuberenetes X or Y?”

I know. There are a number of flavors of Kubernetes that can run your bare metal servers. I spent a lot of time and energy in learning RKE1, and I have gotten very good at managing those clusters. RKE2 is familiar, with improvements in all the right places. I can see automating not only machine provisioning, but the entire process of node replacement. I would love nothing more than to come downstairs on a Monday morning and see newly provisioned cluster nodes humming away after my automated process ran.

So, yes, maybe I skipped a good portion of that “research” step, but I am ok with it. After all, it is my home lab: I am more interested in re-visiting Gravitee.io for API management and starting to put some real code services out in the world.
February 8, 2023
Nothing says “Friday Night Fun” like crashing an RKE Cluster!

Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

It all started with an upgrade…

All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

I lucked out

Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

Cleaning Up

Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

Lessons Learned

For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.

February 4, 2023
Lessons in Managing my Kubernetes Cluster: Man Down!

I had a bit of a panic this week as routine tasks took me down a rabbit hole in Kubernetes. The more I manage my home lab clusters, the more I realize I do not want to be responsible for bare metal clusters at work.

It was a typical upgrade…

With ArgoCD in place, the contents of my clusters is very neatly defined in my various infrastructure repositories. I even have a small Powershell script that checks for the latest versions of the tools I have installed and updates my repositories with the latest and greatest.

I ran that script today and noticed a few minor updates to some of my charts, so I figured I would apply those at lunch. Pretty typically, it is a very smooth application and the updates are applied in a few minutes.

However, after having ArgoCD sync, I realized that the Helm charts I was upgrading was stuck in a “Progressing” state. I checked the pods, and they were in a perpetual “Pending” state.

My typical debugging found nothing: there were no events around the pod being unable to be scheduled for any particular reason, and there was no log file, since the pods were not even being created.

My first Google searches suggested a problem with persistent volumes/persistent volume claims. So I poked around in those, going so far as deleting them (after backing up the folders in my NFS target), but, well, no luck.

And there it was…

As it was “unschedulable,” I started trying to find the logs for the scheduler. I could not find them. So I logged in to my control plane node to see if the scheduler was even running… It was not.

I checked the logs for that container, and nothing stood out. It just kind of “died.” I tried restarting the Docker container…. Nope. I even tried re-running the rke up command for that cluster, to see if Rancher Kubernetes Engine would restart it for me…. Nope.

So, running out of options, I changed my cluster.yml configuration file to add a control plane role to another node in the cluster and re-ran rke up. And that, finally, worked. At that point, I removed the control plane role from the old control plane node and modified the DNS entries to point my hostnames to the new control plane node. With that, everything came back up.

Monitoring?

I wanted to write an alert in Mimir to catch this so I would know about it before I dug around. It was at this point I realized that I am not collecting any metrics from the Kubernetes components themselves. And, frankly, I am not sure how. RKE installs metrics-server by default, but I have not found a way to scrape metrics from Kubernetes components.

My Google searches have been fruitless, and it has been a long work week, so this problem will most likely have to wait for a bit. If I come up with something I will update this post.

November 11, 2022
Hitting for the cycle…
Well, I may not be hitting for the cycle, but I am certainly cycling Kubernetes nodes like it is my job. The recent OpenSSL security patches got me thinking that I need to cycle my cluster nodes.

A Quick Primer

In Kubernetes, a “node” is, well, a machine performing some bit of work. It could be a VM or a physical machine, but, at its core, it is a machine with a container runtime of some kind that runs Kubernetes workloads.

Using the Rancher Kubernetes Engine (RKE), each node can have one or more of the following roles
- worker – The node can host user workloads
- etcd – The node is a member of the etcd storage cluster
- controlplane – The node is a member of the control plane
Cycle? What and why

When I say “cycling” the nodes, I am actually referring to the process of provisioning a new node, adding the new node to the cluster, and removing an old node from the cluster. I use the term “cycling” because, when the process is complete, my cluster is “back where it started” in terms of resourcing, but with a fresh and updated node.

But, why cycle? Why not just run the necessary security patches on my existing nodes? In my view, even at the home lab level, there a few reasons for this method.

A Clean and Consistent Node

As nodes get older, they incur the costs of age. They may have some old container images from previous workloads, or some cached copies of various system packages. In general, they collect stuff, and a fresh node has none of that cost. By provisioning new nodes, we can ensure that the latest provisioning is run and all the necessary updates are installed.

By using newly provisioned nodes each time, it prevents me from applying special configuration to nodes. If I need a particular configuration on a node, well, it has to be done in the provisioning scripts. All my cluster nodes are provisioned the same way, which makes them much more like cattle than pets.

No Downtime or Capacity Loss

Running node maintenance can potentially require a system reboot or service restarts, which can trigger downtime. In order to prevent downtime, it is recommended that nodes be “cordoned” (prevent new workload from being scheduled on that node) and “drained” (remove workloads from the node).

Kubernetes is very good at scheduling and moving workloads between nodes, and there is built-in functionality for cordon and draining of nodes. However, to prevent downtime, I have to remove a node for maintenance, which means I lose some cluster capacity during maintenance.

When you think about it, to do a zero-downtime update, your choices are:
1. Take a node out of the cluster, upgrade it, then add it back
2. Add a new node to the cluster (already upgraded) then remove the old node.
So, applying the “cattle” mentality to our infrastructure, it is preferred to have disposable assets rather than precisely manicured assets.

RKE performs this process for you when you remove a node from the cluster. That means, running rke up will remove old nodes if they have been removed from your cluster.yml.

To Do: Can I automate this?

As of right now, this process is pretty manual, and goes something like this:
1. Provision a new node – This part is automated with Packer, but I really only like doing 1 at a time. I am already running 20+ VMs on my hypervisor, I do not like the thought of spiking to 25+ just to cycle a node.
2. Remove nodes, one role at a time – I have found that RKE is most stable when you only remove one role at a time. What does that mean? It means, if I have a node that is running all three nodes, I need to remove the control plane role, then run rke up. Then remove the etcd role, and run rke up again. Then remove the node completely. I have not had good luck simply removing a node with all three roles.
3. Ingress Changes – I need to change the IPs on my cluster in two places:
  - In my external load balancer, which is a Raspberry Pi running nginx.
  - In my nginx-ingress installation on the cluster. This is done via my GitOps repository.
4. DNS Changes – I have aliases setup for the control plan nodes so that I can swap them in and out easily without changing other configurations. When I cycle a control plane node, I need to update the DNS.
5. Delete Old Node – I have a small Powershell script for this, but it is another step.
It would be wonderful if I could automate this into an Azure DevOps pipeline, but there are some problems with that approach.
1. Packer’s Hyper-V builder has to run on the host machine, which means I need to be able to execute the Packer build commands on my server. I’m not about about put the Azure DevOps agent directly on my server.
2. I have not found a good way to automate DNS changes, outside of using the Powershell module.
3. I need to automate the IP changes in the external proxy and the cluster ingress. Both of which are possible but would require some research on my end.
4. I would need to automate the RKE actions, specifically, adding new nodes and deleting roles/nodes from the cluster, and then running rke up as needed.
None of the above is impossible, however, it would require some effort on my part to research some techniques and roll them up into a proper pipeline. For the time being, though, I have set my “age limit” for nodes at 60 days, and will continue the manual process. Perhaps, after a round or two, I will get frustrated enough to start the automation process.
November 4, 2022
I’ll take ‘Compatibility Matrices’ for $1000, Alex…
I have posted a lot on Kubernetes over the past weeks. I have covered a lot around tools for observability, monitoring, networking, and general usefulness. But what I ran into over the past few weeks is both infuriating and completely avoidable, with the right knowledge and time to keep up to speed.

The problem

I do my best to try and keep things up to date in the lab. Not necessarily bleeding edge, but I make an effort to check for new versions of my tools and update them. This includes the Kubernetes versions of my clusters.

Using RKE, I am limited to what RKE supported. So I have been running 1.23.6 (or, more specifically, v1.23.6-rancher1-1) for at least a month or so.

About two weeks ago, I updated my RKE command line and noticed that v1-23.8-rancher1-1 was available. So I changed the cluster.yml in my non-production environment and ran rke up. No errors, no problems, right?? So I made the change to the internal cluster and started the rke up for that cluster. As that was processing, however, I noticed that my non-production cluster was down. Like, the Kube API was running, so I could get the pod list. But every pod was erroring out. I did not notice anything in the pod events that was even remotely helpful, other than it couldn’t be scheduled. As I didn’t have time to properly diagnose, I attempted to roll the cluster back to 1.23.6. That worked, so I downgraded and left it alone.

I will not let the machines win, so I stepped back into this problem today. I tried upgrading again (both to 1.23.8 and 1.24.2), with the same problems. In digging into the docker container logs on the hosts, I found the smoking gun:
```
Unit kubepods-burstable.slice not found.
```
Eh? I can say I have never seen that phrase before. But a quick Google search pointed me towards cgroup and, more generally, the docker runtime.

Compatibility, you fickle fiend

As it turns out, there is quite a large compatibility matrix between Rancher, RKE, Kubernetes, and Docker itself. My move to 1.23.8 clearly pushed my Kubernetes version past what my Docker version is (which, if you care, was Docker 20.10.12 on Ubuntu 22.04).

I downgraded the non-production cluster once again, then upgraded Docker on those nodes (a simple sudo apt upgrade docker). Then I tried the upgrade, first to v1.23.8 (baby steps, folks).

Success! Kubernetes version was upgraded, and all the pods restarted as expected. Throwing caution to the wind, I upgraded the cluster to 1.24.2. And, this time, no problems.

Lessons Learned

Kubernetes is a great orchestration tool. Coupled with the host of other CNCF tooling available, one can design and build robust application frameworks which let developers focus more on business code than deployment and scaling.

But that robustness comes with a side of complexity. The host of container runtimes, Kubernetes versions, and third party hosts means cluster administration is, and will continue to be, a full time job. Just look at the Rancher Manager matrix at SUSE. Thinking about keeping production systems running while jugging the compatibility of all these different pieces makes me glad that I do not have to manage these beasts daily.

Thankfully, cloud providers like Azure, GCP, and AWS, provide some respite by simplifying some of this. Their images almost force compatibility, so that one doesn’t run into the issues that I ran into on my home cluster. I am much more confident in their ability to run a production cluster than I am in my own skills as a Kubernetes admin.
August 23, 2022
Breaking an RKE cluster in one easy step
With the release of Ubuntu’s latest LTS release (22.04, or “Jammy Jellyfish), I wanted to upgrade my Kubernetes nodes from 20.04 to 22.04. What I had hoped would be an easy endeavor turned out to be a weeks-long process with destroyed clusters and, ultimately, an ETCD issue.

The Hypothesis

As I viewed it, I had two paths to this upgrade: in-place upgrade on the nodes, or bring up new nodes and decommission the old ones. As the latter represents the “Yellowstone” approach, I chose that one. My plan seemed simple:
- Spin up new Ubuntu 22.04 nodes using Packer.
- Add the new nodes to the existing clusters, assigning the new nodes all the necessary roles (I usually have 1 control_plane, 3 etcd, and all are worker
- Remove the control_plane from the old node and verify connectivity
- Remove the old nodes (cordon, drain, and remove)
Internal Cluster

After updating my Packer scripts for 22.04, I spun up new nodes for my internal cluster, which has an ELK stack for log collection. I added the new nodes without a problem, and thought that maybe I could combine the last two steps and just remove all the nodes at the same time.

That ended up with the Rancher CLI getting stuck in checking ETCD health. I may have gotten a little impatient and just killed the Rancher CLI process mid-job. This left me with, well, a dead internal cluster. So, I recovered the cluster (see my note on cluster recovery below) and thought I’d try again with my non-production cluster.

Non-production Cluster

Some say that the definition of insanity is doing the same thing and expecting a different result. My logic, however, was that I made two mistakes the first time through:
- Trying to remove the controlplane alongside of the etcd nodes in the same step
- Killing the RKE CLI command mid-stream
So I spun up a few new non-production nodes, added them to the cluster, and simply removed controlplane from the old node.

Success! My new controlplane node took over, and cluster health seemed good. And, in the interest of only changing one variable at a time, I decided to try and remove just one old node from the cluster.

Kaboom….

Same issue in recovering the etcd volume. So I recovered the cluster and returned to the drawing board.

Testing

At this point, I only had my Rancher/Argo cluster and my production cluster, which houses, among other things, this site. I had no desire for wanton destruction of these clusters, so I setup a test cluster to see if I could replicate my results. I was able to, at which point I turned to the RKE project in Github for help.

After a few days, someone pointed me to a relatively new Rancher issue describing my predicament. If you read through those various issues, you’ll find that etcd 3.5 has an issue where node removal can corrupt its database, causing issues such as mine. The issue was corrected in 3.5.3.

I upgraded my RKE CLI and ran another test with the latest rancher Kubernetes version. This time, finally, success! I was able to remove etcd nodes without crashing the cluster.

Finishing up / Lessons Learned

Before doing anything, I upgraded all of my clusters to the latest supported Kubernetes version. In my case, this is v1.23.6-rancher1-1. Following the steps above, I was, in fact, able to progress through upgrading both my Rancher/Argo cluster and my production cluster without bringing down the clusters.

Lessons learned? Well, patience is key (don’t kill cluster management processes mid-effort), but also, sometimes it is worth a test before you try things. Had any of these clusters NOT been my home lab clusters, this process, seemingly simple, would have incurred downtime in more important systems.

A note on Cluster Recovery

For both the internal and non-production clusters, I could have scrambled to recover the ETCD volume for that cluster and brought it back to life. But I realized that there was no real valuable data in either cluster. The ELK logs are useful real-time but I have not started down the path of analyzing history, so I didn’t mind losing them. And even those are on my SAN, and the PVCs get archived when no longer in use.

Instead of a long, drawn out recovery process, I simply stood up brand new clusters, pointed my instance of Argo to them and updated my Argo applications to deploy to the new cluster. Inside of an hour, my apps were back up and running. This is something of a testament to the benefits of storing a cluster’s state in a repository: recreation was nearly automatic.
May 9, 2022