Tag: hyperv

  • Speeding up Packer Hyper-V Provisioning

    I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

    Check the tires

    A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

    Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

    Test Flight Alpha: Initial Cluster Provisioning

    A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

    I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

    Test Flight Beta: Node Replacement

    The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

    During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

    Quick Explanation

    I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

    Back to it..

    As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

    Well, deleting a node returns its IP to the pool, so the process looks something like this:

    • First new node provisioned (IP .45 assigned)
    • First old node deleted (return IP .25 to the pool)
    • Second new node provisioned (IP .25 assigned)

    My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

    In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

    Test Flight Charlie: Changing IP assignments

    I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

    Landing it…

    With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

    I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.

  • A big mistake and a bit of bad luck…

    In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

    Permissions are a dangerous thing

    I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

    I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

    Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

    What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

    I thought I got it back…

    After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

    I was wrong…

    Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

    I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

    What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

    Now what?

    I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

    I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

    So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

    There was a catch…

    Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

    My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

    At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.

  • Home Lab – No More iSCSI – Backup Plans

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    It is worth nothing (and quite ironic) that I went through a fire drill last week when I crashed my RKE clusters. That event gave me some fresh eyes into the data that is important to me.

    How much redundancy do I need?

    I have been relying primarily on the redundancy of the Synology for a bit too long. The volume has the capability to lose a disk and the Synology has been very stable, but that does not mean I should leave things as they are.

    There are many layers of redundancy, and for a home lab, it is about making decisions as to how much you are willing to pay and what you are willing to lose.

    No Copy, Onsite Copy, Offsite Copy

    I prefer not to spend a ton of time thinking about all of this, so I created three “buckets” for data priority:

    • No Backup: Synology redundancy is sufficient. If I lose it, I lose it.
    • Onsite Copy: Create another copy of the data somewhere at home. For this, I am going to attach a USB enclosure with a 2TB disk to my Synology and setup USB copy tasks on the Diskstation Manager (DSM).
    • Offsite Copy: Ship the data offsite for safety. I have been using Backblaze B2 buckets and the DSM’s Cloud Sync for personal documents for years, but the time has come to scale up a bit.

    It is worth noting that some things may be bucketed into both Onsite and Offsite, depending on how critically I need the data. With the inventory I took over the last few weekds, I had some decisions to make.

    • Domain Controllers -> OnSite copy for sure. I am not yet sure if I want to add an Offsite copy, though: The domain doesn’t have enough on it that it cannot be rebuilt quickly, and there are really only a handful of machines on it. It just makes managing Windows Servers much easier.
    • Kubernetes NFS Data -> I use nfs-subdir-external-provisioner to provide persistent storage for my Kubernetes clusters. I will certainly do OnSite copies of this data, but for the most important ones (such as this blog), I will also setup an offsite transfer.
    • SQL Server Data -> The SQL Server data is being stored on an iSCSI LUN, but I configured regular backups to go to a file share on the Synology. From there, OnSite backups should be sufficient.
    • Personal Stuff -> I have a lot of personal data (photos, financial data, etc.) stored on the Synology. That data is already encrypted and sent to Backblaze, but I may add another layer of redundancy and do an Onsite copy of them as well.

    Solutioning

    Honestly, I thought this would be harder, but Synology’s DSM and available packages really made it easy.

    1. VM Backups with Active Backup for Business: Installed Active Backup for Business, setup a connection to my Hyper-V server, picked the machines I wanted to backup…. It really was that simple. I should test a recovery, but on a test VM.
    2. Onsite Copies with USB Copy: I plugged an external HD into the Synology, which was immediately recognized and a file share created. I installed the USB Copy package and started configuring tasks. Basically, I can setup copy tasks to move data from the Synology to the USB as desired, and includes various settings, such as incremental or versioned backups, triggers, and file filters.
    3. SQL Backups: I had to refresh my memory on scheduling SQL backups in SQL Server. Once I had that done, I just made sure to back them up to a share on the Synology. From there, USB Copy took care of the rest.
    4. Offsite: As I mentioned, I have had Cloud Sync running to Backblaze B2 buckets for a while. All I did was expand my copying. Cloud Sync offers some of the same flexibility as USB Copy, but having well-structured file shares for your data makes it easier to select and push data as you want it.

    Results and What’s next

    My home lab refresh took me about 2 weeks, albeit during a few evenings across that time span. What I am left with is a much more performant server. While I still store data on the Synology via NFS and iSCSi, it’s only smaller parts that are less reliant on fast access. The VM disks live on an SSD RAID array on the server, which gives me added stability and less thrashing of the Synology and its SSD cache. This is no more evident than the fact that my average daily SSD temp has gone down 12°F over the last 2 weeks.

    What’s next? I will be taking a look at alternatives to Rancher Kubernetes Engine. I am hoping to find something a bit more stable and secure to manage.

  • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning
    • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild (this post)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    Observations – Migrating Servers

    The focus of my hobby time over the few days has been moving production assets to the temporary server. Most of it is fairly vanilla, but I have a few observations worth noting.

    • I forgot how easy it was to replicate and failover VMs with Hyper-V. Sure, I could have tried a live migration, but creating a replica, shutting down the machine, and failing over was painless.
    • Do not forget to provision an external virtual switch on your Hyper-V servers. Yes, sounds stupid, but, I dove right in to setting the temporary server up as a replication server, and upon trying to failover, realized that the machine on the new server did not have a network connection.
    • I moved my Minio instance to the Synology: I originally had my Minio server running on an Ubuntu VM on my hypervisor, but decided moving the storage application closer to the storage medium was generally a good idea.
    • For my Kubernetes nodes, it was easier to provision new nodes on the temp server than it was to do a live migration or planned failover. I followed my normal process for provisioning new nodes and decommissioning old ones, and viola, my production cluster is on the temporary server. I will simply reverse the process for the transfer back.
    • I am getting noticeably better performance on the temporary server, which has far less compute and RAM, but the VMs are on local disks. While the Synology has been a rock solid, I think I have been throwing too much at it, and it can slow down from time to time.

    Let me be clear: My network storage is by no means bad, and it will be utilized. But storing the primary vhdx files for my VMs on the hypervisor provides much better performance.

    Shut It Down!

    After successfully moving my production assets over to the temporary server, it was time to shut it down. I shut down the VMs that remained on the original hypervisor and attempted to copy the VMs to a network drive on the Synology. That was a giant mistake.

    Those VM files already live on the Synology as part of an iSCSI volume. By trying to pull those files off of the iSCSI drive and copy them back to the Synology, I was basically doing a huge file copy (like, 600+ GB huge) without the systems really knowing it was copy. As you can imagine, the performance was terrible.

    I found a 600TB SAS drive that I was able to plug into the old hypervisor, and I used that as a temporary location for the copy. Even with that change, the copy took a while (I think about 3 hours).

    Upgrade and Install

    I mounted my new SSDs (Samsung EVO 1TB) in some drive trays and plugged them into the server. A quick boot to the Smart Storage administrator let me setup a new drive array. While I thought about just using RAID 0 and letting me have 2 TB of stuff, I went the safe option and used RAID 1.

    Having configured the temporary server with Windows Server Hyper-V 2019, the process of doing it again was, well, pretty standard. I booted to the USB stick I created earlier for Hyper-V 2019 and went through the paces. My domain controller was still live (thanks temporary server!), so I was able to add the machine to domain and then perform all of the management via the Server Manager tool on my laptop.

    Moving back in

    I have the server back up with a nice new 1TB drive for my VMs. That’s a far cry from the 4 TB of storage I had allocated on the SAN target on the Synology, so I have to be more careful with my storage.

    Now, if I set a Hyper-V disk to, say, 100Gb, Hyper-V does not actually provision a file that is 100Gb: the vhdx file grows with time. But that does not mean I should just mindlessly provision disk space on my VMs.

    For my Kubernetes nodes, looking at my usage, 50GB is more than enough for those disks. All persistent storage for those workloads is handled by an NFS provisioner which configures shares on the Synology. As for the domain controllers, I am able to run with minimal storage because, well, it is a tiny domain.

    The problem children are Minio and my SQL Server Databases. Minio I covered above, moving it to the Synology directly. SQL Server, however, is a different animal.

    Why be you, when you can be new!

    I already had my production SQL instance running on another server. Rather than move it around and then mess with storage, I felt the safer solution was to provision a new SQL Server instance and migrate my databases. I only have 4 databases on that server, so moving databases is not a monumental task.

    A new server affords me two things:

    1. Latest and greatest version of Windows and SQL server.
    2. Minimal storage on the hypervisor disk itself. I provisioned only about 80 GB for the main virtual disk. This worked fine, except that I ran into a storage compatibility issue that needed a small workaround.

    SMB 3.0, but only certain ones

    My original intent was to create a virtual disk on a network share on the Synology, and mount that disk to the new SQL Server VM. That way, to the SQL Server, the storage is local, but the SQL data would be on the Synology.

    Hyper-V did not like this. I was able to create a vhdx file on a share just fine, but when I tried to add it to a VM using Add-VMHardDiskDrive, I got the following error:

    Remote SMB share does not support resiliency.

    A quick Google search turned up this Spiceworks question, where the only answer suggests that the Synology SMB 3.0 implementation is Linux-based, where Hyper-V is looking to use the Windows-based implementation, and that there are things missing in Linux.

    While I am usually not one to take one answer and call it fact, I also didn’t want to spend too much time getting into the nitty gritty. I knew it was a possibility that this wasn’t going to work, and, in the interest of time, I went back to my old pal iSCSI. I provisioned a small iSCSI LUN (300 GB) and mounted directly in the virtual machine. So now my SQL Server has a data drive that uses the Synology for storage.

    And we’re back!

    Moves like this provide an opportunity for consolidation, updates, and improvements, and I seized some of those opportunities:

    • I provisioned new Active Directory Domain controllers on updated operating systems, switched over, and deleted the old one.
    • I moved Minio to my Synology, and moved Hashicorp Vault to my Kubernetes cluster (using Minio as a storage backend). This removed 2 virtual machines from the hypervisor.
    • I provisioned a new SQL Server and migrated my production databases to it.
    • Compared to the rats nest of network configuration I had, the networking on the hypervisor is much simpler:
      • 1 standard NIC with a static IP so that I can get in and out of the hypervisor itself.
      • 1 teamed NIC with a static IP attached to the Hyper-V Virtual Switch.
    • For the moment, I did not bring back my “non-production” cluster. It was only running test/stage environments of some of my home projects. For the time being, I will most likely move these workloads to my internal cluster.

    I was able to shut down the temporary server, meaning, at least in my mind, I am back to where I was. However, now that I have things on the hypervisor itself, my next step is to ensure I am appropriately backing things up. I will finish this series with a post on my backup configuration.

  • Home Lab – No More iSCSI, Prep and Planning

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning (this post)
    • Home Lab – No More iSCSI: Shutdown and Provisioning (coming soon)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    I realized today that my home lab setup, by technology standards, is old. Sure, my overall setup has gotten some incremental upgrades, including an SSD cache for the Synology, a new Unifi Security Gateway, and some other new accessories. The base Hyper-V server, however, has remained untouched, outside of the requisite updates.

    Why no upgrades? Well, first, it is my home lab. I am a software engineer by trade, and the lab is meant for me to experiment not with operating systems or network configurations, but with application development and deployment procedures, tools, and techniques. And for that, it has worked extremely well over the last five years.

    That said, my initial setup had some flaws, and I am seeing some stability issues that I would like to correct now, before I wake up one morning with nothing working. With that in mind, I have come up with a plan.

    Setup a Temporary Server

    I am quite sure you’re thinking to yourself “What do you mean, temporary server?” Sure, I could shut everything down, copy it off the server onto the Synology, and then re-install the OS. And while this is a home lab and supposedly “throw away,” there are some things running that I consider production. For example:

    1. Unifi Controller – I do not yet have the luxury of running a Unifi Dream Machine Pro, but it is on my wish list. In the meantime, I run an instance of the Unifi controller in my “production” cluster.
    2. Home Assistant – While I am still rocking an ISY994i as an Insteon interface, I moved most of my home automation to a Home Assistant instance in the cluster.
    3. Node Red – I have been using Node-Red with the Home Assistant palette to script my automations.
    4. Windows Domain Controller – I am still rocking a Windows Domain at home. It is the easiest way to manage the Hypervisor, as I am using the”headless” version of Windows Hyper-V Server 2019.
    5. Mattgerega.com – Yup, this site runs in my cluster.

    Thankfully, my colleague Justin happened to have an old server lying around that he has not powered on in a while, and has graciously allowed me to borrow it so that I can transfer my production assets over and keep things going.

    We’re gonna change the way we run…

    My initial setup put the bulk of my storage on the Synology via iSCSI, so much so that I had to put an SSD cache in the Synology just to keep up. At the beginning, that made sense. I was running mostly Windows VMs, and my vital data was stored on the VM itself. I did not have a suitable backup plan, so having all that data on the Synology meant I had at least some drive redundancy.

    Times have changed. My primary mechanism for running applications is via Kubernetes clusters. Those nodes typically contain no data at all, as I use an NFS provisioner and storage class to create persistent storage volumes via NFS on the Synology. And while I still have a few VMs with data on them that will need backed up, I really want to get away from iSCSI.

    The server I have, an old HP Proliant DL380 Gen8, has 8 2.5″ drive bays. My original impression was that I needed to buy SAS drives for it, but Justin said he has had luck running SATA SSDs in his.

    Requirements

    Even with a home lab move, it is always good to have some clear requirements.

    1. Upgrade my Hyper-V Server to 2019.
    2. No more iSCSI disks on the server: Rely on NFS and proper backup procedures.
    3. Fix my networking: I had originally teamed 4 of the 6 NIC ports on the server together. While I may still do that, I need to clean up that implementation, as I have learned a lot in the last few years.
    4. Keep is simple.

    Could I explore VMWare or Proxmox? I could, but, frankly, I want to learn more about Kubernetes and how I can use it in application architecture to speed delivery and reliability. I do not really care what the virtualization technology is, as long as I can run Kubernetes. Additionally, I have a LOT of automation around building Hyper-V machines, I do not want to do it again.

    Since this is my home lab, I do not have a lot of time to burn on it, hence the KISS methods. Switching virtualization stacks means more time converting images. Going from Hyper-V to Hyper-V means, for production VMs, I can setup replication and just move them to the temp server and back again.

    Prior Proper Planning

    With my requirements set, I created a plan:

    • Configure the temporary server and get production servers moved. This includes consolidating “production” databases into a single DB server, which is a matter of moving one or two DBs.
    • Shut down all other VMs and copy them over to a fileshare on the Synology.
    • Fresh installation of Windows Server 2019 Hyper-V.
    • Add 2 1TB SSDs into the hypervisor in a RAID 1 array.
    • Replicate the VMs from the temporary server to the new hypervisor.
    • Copy the rest of the VMs to the new server and start them up.
    • Create some backup procedures for data stored on the hypervisor (i.e., if it is on a VM’s drive, it needs put on the Synology somewhere)
    • Delete my iSCSI LUN from the Synology.

    So, what’s done?

    I am, quite literally, still on step one. I got the temporary server up and running with replication, and I am starting to move production images. Once my temporary production environment is running, I will get started on the new server. I will post some highlights of that process in the days to come.

  • Managing Hyper-V VM Startup Times with .Net Minimal APIs

    In a previous post, I had a to-do list that included managing my Hyper-V VMs so that they did not all start at once. I realized today that I never explained what I was able to do or post the code for my solution. So today, you get both.

    And, for the impatient among you, my repository for this API is on Github.

    Managing VMs with Powershell

    My plan of attack was something like this:

    • Organize Virtual Machines into “Startup Groups” which can be used to set Automatic Start Delays
    • Using the group and an offset within the group, calculate a start delay and set that value on the VM.

    Powershell’s Hyper-V Module is a powerful and pretty easy way to interact with the Hyper-V services on a particular machine. The module itself had all of the functionality I needed to implement my plan. This included the ability to modify the Notes of a VM. I am storing JSON in the Notes field to denote the start group and offset within the group. Powershell has the built in JSON conversion necessary to make quick work of retrieving this data from the VM’s Notes field and converting it into an object.

    Creating the API

    For the API, this seemed an appropriate time to try out the Minimal APIs in ASP.NET Core 6. Minimal APIs are Microsoft’s approach to building APIs fast, without all the boilerplate code that sometimes comes with .Net projects. For this project, as I only had three endpoints (and maybe some test/debug ones) and a few services, so it seemed a good candidate.

    Without getting into the details, I was pleased with the approach, although scaling this type of approach requires implementing some standards that, in the end, would have you re-designing the notion of Controllers as it exists in a typical API project. So, while it is great for small, agile APIs, if you expect your API to grow, stick with the Controller-structured APIs.

    Hosting the API

    The server I am using as a Hyper-V hypervisor is running a version of Windows Hyper-V server, which means it has a limited feature set that does not include Internet Information Systems (IIS). Even if it did, I want to keep the hypervisor focused on running VMs. However, in order to manage the VMs, the easiest path is to put the API on the hypervisor.

    With that in mind, I went about configuring this API to run within a Windows Service. That allowed me to ensure the API was running through standard service management (instead of as a console application) but still avoid the need for a heavy IIS install.

    I installed the service using one of the methods described in How to: Install and uninstall Windows services.  However, for proper access, the service needs to run as a user with Powershell access and rights to modify the VMs.  

    I created a new domain user and granted it the ability to perform a service log on via local security policy.  See Enable Service Logon for details.

    Prepping the VMs

    The API does not, at the moment, pre-populate the Notes field with JSON settings. So I went through my VM List and added the following JSON snippet:

    {
        "startGroup": 0,
        "delayOffset": 0
    }

    I chose a startGroup value based on the VM’s importance (Domain Controllers first, then data servers, then Kubernetes nodes, etc), and then used the delayOffset to further stagger the start times.

    All this for an API call

    Once each VM has the initialization data, I made a call to /vm/refreshdelay and viola! The AutomaticStartDelay gets set based on its startGroup and delayOffset.

    There is more to do (see my to-do list in my previous post for other next steps), but since I do not typically provision many machines, this one usually falls to a lower spot on the priority list. So, well, I apologize in advance if you do not see more work on this for another six months.

  • An Impromptu Home Lab Disaster Recovery Session

    It has been a rough 90 days for my home lab. We have had a few unexpected power outages which took everything down. And, for the unexpected outages, things came back up. Over the weekend, I was doing some electrical work outside, wiring up outlets and lighting. Being safety conscious, I killed the power to the breaker I was tying into inside, not realize it was the same breaker that the server was on. My internal dialog went something like this:

    • “Turn off breaker Basement 2”
    • ** clicks breaker **
    • ** Hears abrupt stop of server fans **
    • Expletive….

    When trying to recover from that last sequence, I ran into a number of issues.

    • I’m CRUSHING that server when it comes back up: having 20 VMs attempting to start simultaneously is causing a lot of resource contention.
    • I had to run fsck manually on a few machines to get them back up and running.
    • Even after getting the machines running, ETCD was broken on two of my four clusters.

    Fixing my Resource Contention Issues

    I should have done this from the start, but all of my VMs had their Automatic Start Action set to Restart if previously running. That’s great in theory, but, in practice, starting 20 or so VMs on the same hypervisor is not recommended.

    Part of Hyper-V’s Automatic Start Action panel is an Automatic Startup Delay. In Powershell, it is the AutomaticStartDelay property on the VirtualMachine object (what’s returned from a Get-VM call). My ultimate goal is to set that property to stagger start my VMs. And I could have manually done that and been done in a few minutes. But, how do I manage that when I spin up new machines? And can I store some information on the VM to reset that value as I play around with how long each VM needs to start up?

    Groups and Offsets

    All of my VMs can be grouped based on importance. And it would have been easy enough to start 2-3 VMs in group 1, wait a few minutes, then do group 2, etc. But I wanted to be able to assign offsets within the groups to better address contention. In an ideal world, the machines would come up sequentially to a point, and then 2 or 3 at a time after the main VMs have started. So I created a very simple JSON object to track this:

    {
      "startGroup": 1,
      "delayOffset": 120
    }

    There is a free-text Notes field on the VirtualMachine object, so I used that to set a startGroup and delayOffset for each of my VMs. Using a string of Powershell commands, I was able to get a tabular output of my custom properties:

    get-vm | Select Name, State, AutomaticStartDelay, @{n='ASDMin';e={$_.AutomaticStartDelay / 60}}, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} | Sort-Object AutomaticStartDelay | format-table
    • Get-VM – Get a list of all the VMs on the machine
    • Select Name, ... – The Select statement (alias to Select-Object) pulls values form the object. There are two calculated properties that pull values from the Notes field as a JSON object.
    • Sort-Object – Sort the list by the AutomaticStartDelay property
    • Format-Table – Format the response as a table.

    At that point, the VM had its startGroup and delayOffset, but how can I set the AutomaticStartDelay based on those? More Powershell!!

    get-vm | Select Name, State, AutomaticStartDelay, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} |? {$_.startGroup -gt 0} | % { set-vm -name $_.name -AutomaticStartDelay ((($_.startGroup - 1) * 480) + $_.delayOffset) }

    The first two commands are the same as the above, but after that:

    • ? {$_.startGroup -gt 0} – Use Where-Object (? alias) to select VMs with a startGroup value
    • % { set-vm -name ... }ForEach-Object (% alias) in that group, set the AutomaticStartDelay.

    In the command above, I hard-coded the AutomaticStartDelay to the following formula:

    ((startGroup - 1) * 480) + delayOffset

    With this formula, the server will wait 4 minutes between groups, and add a delay within the group should I choose. As an example, my domain controllers carry the following values:

    # Primary DC
    {
      "startGroup": 1,
      "delayOffset": 0
    }
    # Secondary DC
    {
      "startGroup": 1,
      "delayOffset": 120
    }

    The calculated delay for my domain controllers is 0 and 120 seconds, respectively. The next group won’t start until 480 seconds (4 minutes), which gives my DCs 4 minutes on their own to boot up.

    Now, there will most likely be some tuning involved in this process, which is where my complexity becomes helpful: say I can boot 2-3 machines every 3 minutes… I can just re-run the population command with a new formula.

    Did I over-engineer this? Probably. But the point is, use AutomaticStartDelay if you are running a lot of VMs on a Hypervisor.

    Restoring ETCD

    Call it fate, but that last power outage ended up causing ETCD issues in two of my servers. I had to run fsck manually on a few of my servers to repair the file system. Even when the servers were up and running, two of my clusters had problems with their ETCD services.

    In the past, my solution to this was “nuke the cluster and rebuild,” but I am trying to be a better Kubernetes administrator, so this time, I took the opportunity to actually read the troubleshooting documentation that Rancher provides.

    Unfortunately, I could not get past “step one:” ETCD was not running. Knowing that it was most likely a corruption of some kind and that I had a relatively up-to-date ETCD snapshot, I did not burn too much time before going to the restore.

    rke etcd snapshot-restore --name snapshot_name_etcd --config cluster.yml

    That command worked like a charm, and my clusters we back up and running.

    To Do List

    I have a few things on my to-do list following this adventure:

    1. Move ETCD snapshots off of the VMs and onto the SAN. I would have had a lot of trouble bringing ETCD back up if those snapshots were not available because the node they were on went down.
    2. Update my Packer provisioning scripts to include writing the startup configuration to the VM notes.
    3. Build an API wrapper that I can run on the server to manage the notes field.

    I am somewhat interested in testing how the AutomaticStartDelay changes will affect my server boot time. However, I am planning on doing that on a weekend morning during a planned maintenance, not on a random Thursday afternoon.