An Impromptu Home Lab Disaster Recovery Session

It has been a rough 90 days for my home lab. We have had a few unexpected power outages which took everything down. And, for the unexpected outages, things came back up. Over the weekend, I was doing some electrical work outside, wiring up outlets and lighting. Being safety conscious, I killed the power to the breaker I was tying into inside, not realize it was the same breaker that the server was on. My internal dialog went something like this:

“Turn off breaker Basement 2”
** clicks breaker **
** Hears abrupt stop of server fans **
Expletive….

When trying to recover from that last sequence, I ran into a number of issues.

I’m CRUSHING that server when it comes back up: having 20 VMs attempting to start simultaneously is causing a lot of resource contention.
I had to run fsck manually on a few machines to get them back up and running.
Even after getting the machines running, ETCD was broken on two of my four clusters.

Fixing my Resource Contention Issues

I should have done this from the start, but all of my VMs had their Automatic Start Action set to Restart if previously running. That’s great in theory, but, in practice, starting 20 or so VMs on the same hypervisor is not recommended.

Part of Hyper-V’s Automatic Start Action panel is an Automatic Startup Delay. In Powershell, it is the AutomaticStartDelay property on the VirtualMachine object (what’s returned from a Get-VM call). My ultimate goal is to set that property to stagger start my VMs. And I could have manually done that and been done in a few minutes. But, how do I manage that when I spin up new machines? And can I store some information on the VM to reset that value as I play around with how long each VM needs to start up?

Groups and Offsets

All of my VMs can be grouped based on importance. And it would have been easy enough to start 2-3 VMs in group 1, wait a few minutes, then do group 2, etc. But I wanted to be able to assign offsets within the groups to better address contention. In an ideal world, the machines would come up sequentially to a point, and then 2 or 3 at a time after the main VMs have started. So I created a very simple JSON object to track this:

{
  "startGroup": 1,
  "delayOffset": 120
}

There is a free-text Notes field on the VirtualMachine object, so I used that to set a startGroup and delayOffset for each of my VMs. Using a string of Powershell commands, I was able to get a tabular output of my custom properties:

get-vm | Select Name, State, AutomaticStartDelay, @{n='ASDMin';e={$_.AutomaticStartDelay / 60}}, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} | Sort-Object AutomaticStartDelay | format-table

Get-VM – Get a list of all the VMs on the machine
Select Name, ... – The Select statement (alias to Select-Object) pulls values form the object. There are two calculated properties that pull values from the Notes field as a JSON object.
Sort-Object – Sort the list by the AutomaticStartDelay property
Format-Table – Format the response as a table.

At that point, the VM had its startGroup and delayOffset, but how can I set the AutomaticStartDelay based on those? More Powershell!!

get-vm | Select Name, State, AutomaticStartDelay, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} |? {$_.startGroup -gt 0} | % { set-vm -name $_.name -AutomaticStartDelay ((($_.startGroup - 1) * 480) + $_.delayOffset) }

The first two commands are the same as the above, but after that:

? {$_.startGroup -gt 0} – Use Where-Object (? alias) to select VMs with a startGroup value
% { set-vm -name ... } – ForEach-Object (% alias) in that group, set the AutomaticStartDelay.

In the command above, I hard-coded the AutomaticStartDelay to the following formula:

((startGroup - 1) * 480) + delayOffset

With this formula, the server will wait 4 minutes between groups, and add a delay within the group should I choose. As an example, my domain controllers carry the following values:

# Primary DC
{
  "startGroup": 1,
  "delayOffset": 0
}
# Secondary DC
{
  "startGroup": 1,
  "delayOffset": 120
}

The calculated delay for my domain controllers is 0 and 120 seconds, respectively. The next group won’t start until 480 seconds (4 minutes), which gives my DCs 4 minutes on their own to boot up.

Now, there will most likely be some tuning involved in this process, which is where my complexity becomes helpful: say I can boot 2-3 machines every 3 minutes… I can just re-run the population command with a new formula.

Did I over-engineer this? Probably. But the point is, use AutomaticStartDelay if you are running a lot of VMs on a Hypervisor.

Restoring ETCD

Call it fate, but that last power outage ended up causing ETCD issues in two of my servers. I had to run fsck manually on a few of my servers to repair the file system. Even when the servers were up and running, two of my clusters had problems with their ETCD services.

In the past, my solution to this was “nuke the cluster and rebuild,” but I am trying to be a better Kubernetes administrator, so this time, I took the opportunity to actually read the troubleshooting documentation that Rancher provides.

Unfortunately, I could not get past “step one:” ETCD was not running. Knowing that it was most likely a corruption of some kind and that I had a relatively up-to-date ETCD snapshot, I did not burn too much time before going to the restore.

rke etcd snapshot-restore --name snapshot_name_etcd --config cluster.yml

That command worked like a charm, and my clusters we back up and running.

To Do List

I have a few things on my to-do list following this adventure:

Move ETCD snapshots off of the VMs and onto the SAN. I would have had a lot of trouble bringing ETCD back up if those snapshots were not available because the node they were on went down.
Update my Packer provisioning scripts to include writing the startup configuration to the VM notes.
Build an API wrapper that I can run on the server to manage the notes field.

I am somewhat interested in testing how the AutomaticStartDelay changes will affect my server boot time. However, I am planning on doing that on a weekend morning during a planned maintenance, not on a random Thursday afternoon.