Since cycling my cluster nodes is a “fire script and wait” operation, I kicked one off today. I ended up running into an issue that required me to dig a bit into ETCD in RKE2, and could not find direct help, so this is as much my own reference as it is a guide for others.
I broke it…
When provisioning new machines, I still have some odd behaviors when it comes to IP address assignment. I do not set the IP address manually: I use a static MAC address on the VM and then create a fixed IP for that MAC address. About 90% of the time, that works great. Every so often, though, in the provisioning process, the VM picks up an IP address from the DHCP instead of the fixed IP, and that wrecks stuff, especially around ETCD.
This happened today: In standing up a replacement, the new machine picked up a DHCP IP. Unfortunately, I didn’t remove the machine properly, which caused my ETCD cluster to still see the node as a member. When I deleted the node and tried to re-provision, I got ETCD errors because I was trying to add a node name that already exists.
Getting in to ETCD
RKE2’s docs are a little quiet on actually viewing what’s in ETCD. Through some googling, I figured out that I could use
etcdctl to show and manipulate members, but I couldn’t figure out how to actually run the command.
As it turns out, the easiest way to run it is to run it on one of the ETCD pods itself. I came across this bug report in RKE2 that indirectly showed me how to run
etcdctl commands from my machine through the ETCD pods. The member list command is
kubectl -n kube-system exec <etcd_pod_name> -- sh -c "ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list"
Note all the credential setting via environment variables. In theory, I could “jump in” to the etcd pod using a simple
sh command and run a session, but keeping it like this forces me to be judicious in my execution of
I found the offending entry and removed it from the list, and was able to run my cycle script again and complete my updates.