Do the Right Thing – Matt Gerega

My home lab clusters have been running fairly stable, but there are still some hiccups every now and again. As usual, a little investigation lead to a pretty substantial change.

Cluster on Fire

My production and non-production clusters, which mostly host my own projects, have always been pretty stable. Both clusters are set up with 3 nodes as the control plane, since I wanted more than 1 and you need an odd number for quorum. And since I didn’t want to run MORE machines as agents, I just let those nodes host user workloads in addition to the control plane. With 4 vCPUs and 8 GB of RAM per node, well, those clusters had no issues.

My “internal tools” cluster is another matter. Between Mimir, Loki, and Tempo running ingestion, there is a lot going on in that cluster. I added a 4th node that serves as just an agent for that cluster, but I still had some pod stability issues.

I started digging into the node-exporter metrics for the three “control plane + worker” nodes in the internal cluster, and they were, well, on fire. The system load was consistently over 100% (the 15 minute average was something like 4.05 out of 4 on all three). I was clearly crushing those nodes. And, since those nodes hosted the control plane as well as the workloads, instability in the control plane caused instability in the cluster.

Isolating the Control Plane

At that point, I decided that I could not wait any longer. I had to isolate the control plane and etcd from the rest of my workloads. While I know that it is, in fact, best practice, I was hoping to avoid it in the lab, as it causes a slight proliferation in VMs. How so? Let’s do the math:

Right now, all of my clusters have at least 3 nodes, and internal has 4. So that’s 10 VMs with 4 vCPU and 8 GB of RAM assigned, or 40 vCPUs and 80 GB of RAM. If I want all of my clusters to have isolated control planes, that means more VMs. But…

Control plane nodes don’t need nearly the size if I’m not opening them up to other workloads. And for my non-production cluster, I don’t need the redundancy of multiple control plane nodes. So 4 vCPUs/8GB RAM becomes 2 vCPU/4GB RAM for control plane node, and I can use 1 node for the non-production control plane. But what about the work? To start, I’ll use 2 4 vCPUs/8GB RAM nodes for production and non-production, and 3 of those same node sizes for the internal cluster.

In case you aren’t keeping a running total, the new plan is as follows:

7 small nodes (2 vCPU/4GB RAM) for control plane nodes across the three clusters (3 for internal and production, 1 for non-production)
7 medium nodes (4 vCPU/8GB RAM) for worker nodes across the three clusters (2 for non-production and production, 3 for internal).

So, it’s 14 VMs, up from 10, but it is only an extra 2 vCPUs and 2 GB of RAM. I suppose I can live with that.

Make it so!

With the scripting of most of my server creation, I made a few changes to support this updated structure. I added a taint to the RKE2 configuration for the server so that only critical items are scheduled.

node-taint:
- CriticalAddonsOnly=true:NoExecute

I also removed any server nodes from the tfx-<cluster name> DNS record, since the Nginx pods will only run on agent nodes now.

Once that was done, I just had to provision new agent nodes for each of the clusters, and then replace the current server nodes with newly provisioned nodes that have a smaller footprint and the appropriate taints.

It’s worth noting, in order to prevent too much churn, I manually added the above taint to each existing server node AFTER I had all the agents provisioned but before I started replacing server nodes. That way, Kubernetes would not attempt to schedule a user workload coming off the old server onto another server node, but instead force it on to an agent. For your reference and mine, that command looks like this:

kubectl taint nodes <node name> CriticalAddonsOnly=true:NoSchedule

Results

I would classify this as a success with an asterisk next to it. I need more time to determine if the cluster stability, particularly for the internal cluster, improves with these changes, so I am not willing to declare outright victory.

It has, however, given me a much better view into how much processing I actually need in a cluster. For my non-production cluster, the two agents are averaging under 10% load, which means I could probably lose one agent and still be well under 50% load on that node. The production agents are averaging about 15% load. Sure, I could consolidate, but part of the desire is to have some redundancy, so I’ll stick with two agents in production.

The internal cluster, however, is running pretty hot. I’m running a number of pods for Grafana Mimir/Loki/Tempo ingestion, as well as Prometheus on that cluster itself. So those three nodes are running at about 50-55% average load, with spikes above 100% on the one agent that is running both the Prometheus collector and a copy of the Mimir ingester. I’m going to keep an eye on that and see if the load creeps up. In the meantime, I’ll also be looking to see what, if anything, can be optimized or offloaded. If I find something to fix, you can be sure it’ll make a post.