Home Lab: Disaster Recovery and time for an Upgrade!

I had the opportunity to travel the U.S. Virgin Islands last week on a “COVID delayed” honeymoon. I have absolutely no complaints: we had amazing weather, explored beautiful beaches, and got a chance to snorkel and scuba in some of the clearest water I have seen outside of a pool.

While I was relaxing on the beach, Hurricane Ida was wrecking the Gulf and dropping rain across the East, including at home. This lead to power outages, which lead to my home server having something of a meltdown. And in this, I learned the value of good neighbors who can reboot my server and the cost of not setting up proper disaster recovery in Kubernetes.

The Fall

I was relaxing in my hotel room when I got text messages from my monitoring solution. And, at first, I figured “The power went out, things will come back up in 30 minutes or so.” But, after about an hour, nothing. So I texted my neighbor and asked if he could reset the server. After he reset, most of my sites came back up, with the exception of some of my SQL-dependent sites. I’ve had some issues with SQL Server’s not starting their service correctly, so some sites were down… but that’s for another day.

A few days later, I got the same monitoring alerts. My parents were house-sitting, so I had my mom reset the server. Again, most of my sites came up. Being in America’s Caribbean Paradise, I promptly forgot all about it, figuring things were fine.

The Realization

Boy, was I wrong. When I sat down at the computer on Sunday, I randomly checked my Rancher instance. Dead. My other clusters were running, but all of the clusters were reporting issues with etcd on one of the nodes. And, quite frankly, I was at a loss. Why?

While I have taken some Pluralsight courses on Kubernetes, I was a little overly dependent on the Rancher UI to manage Kubernetes. With it down, I was struggling a bit.
I did not take the time to find and read the etcd troubleshooting steps for RKE. Looking back, I could most likely have restored the etcd snapshots and been ok. Live and learn, as it were.

Long story short, my hacking attempts to get things running again pretty much killed my rancher cluster and made a mess of my internal cluster. Thankfully, the stage and production clusters were still running, but with a bad etcd node.

The Rebuild

At this point, I decided to throw in the towel and rebuild the Rancher cluster. So I deleted the existing rancher machines and provisioned a new cluster. Before I installed Rancher, though, I realized that my old clusters might still try to connect to the new Rancher instance and cause more issues. I took the time to remove the Rancher components from the stage and production clusters using the documented scripts.

When I did this with my internal tools cluster… well, it made me realize there was a lot of unnecessary junk on that cluster. It was only running ELK (which was not even fully configured) and my Unifi Controller, which I moved to my production box. So, since I was already rebuilding, I decided to decommission my internal tools box and rebuild that as well.

With two brand new clusters and two RKE clusters clean of Rancher components, I installed Rancher and got all the management running.

The Lesson

From this little meltdown and reconstruction I have learned a few important lessons.

Save your etcd snapshots off machine – RKE takes snapshots of your etcd regularly, and there is a process for restoring from a snapshot. Had I known that those snapshots were there, I would have tried this before killing my cluster.
etcd is disk-heavy and “touchy” when it comes to latency – My current setup has my hypervisor using my Synology as an iSCSI disk for all my VMs. With 12 VMs running as Kubernetes nodes, any disruption in either network or I/O lag can cause leader changes. This is minimal during normal operation, but performing deployments or updates can sometimes cause issues. I have a small upgrade planned for the Synology to add a 1 TB SSD Read/Write cache which hopefully improves the issue, but I may end up creating a new subnet for iSCSI traffic to alleviate network hiccups.
Slow and steady wins the race – In my haste to get everything working again, I tried some things that did more harm than good. Had I done a bit more research and found the appropriate articles earlier, I probably could have recovered without rebuilding.