• Building Software Longevity

    The “Ship of Theseus” thought experiment is an interesting way to start fights with historians, but in software, replacing old parts with new parts is required for building software longevity. Designing software in ways that every piece can be replaced is vital to building software for the future.

    The Question

    The Wikipedia article presents the experiment as follows:

    According to legend, Theseus, the mythical Greek founder-king of Athens, rescued the children of Athens from King Minos after slaying the minotaur and then escaped onto a ship going to Delos. Each year, the Athenians commemorated this legend by taking the ship on a pilgrimage to Delos to honor Apollo. A question was raised by ancient philosophers: After several centuries of maintenance, if each individual part of the Ship of Theseus was replaced, one at a time, was it still the same ship?

    Ship of Theseus – Wikipedia

    The Software Equivalent

    Consider Microsoft Word. Released in 1983, Word is approaching its 40th anniversary. And, while I do not have access into its internal workings, I am willing to bet that most, if not all of the 1983 code has since been replaced by updated modules. So, while it is still called Word, its parts are much newer than the original 1983 iteration. I am sure if I sat here long enough, I could identify other applications with similar histories. The branding does not change, the core functionality does not change, only the code changes.

    Like the wood on the Ship of Theseus, software rots. And it rots fast. Frameworks and languages evolve quickly to take advantage of hardware updates, and the software that uses those must do the same.

    Design for Replacement

    We use the term technical debt to categorize the conscious choices we make to prioritize time to market over perfect code. It is worthwhile to consider software has a “half life” or “depreciation factor” as well: while your code may work today, chances are, without appropriate maintenance, it will rot into something that is no longer able to serve the needs of your customers.

    If I had a “one sized fits all” solution to software rot, I would probably be a millionaire. The truth is, product managers, engineers, and architects must be aware of software rot. Not only must we invest the time into fixing the rot, but we must design our products to allow every piece of the application to be replaced, even as the software still stands.

    This is especially critical in Software as a Service (SaaS) offerings, where we do not have the luxury of large downtimes for upgrades or replacements. To our customers, we should operate continuously, with upgrades happening almost invisibly. This requires the foresight to build replaceable components and the commitment to fixing and replacing components regularly. If you cannot replace components of your software, there will come a day where your software will no longer function for your customers.

  • Speeding up Packer Hyper-V Provisioning

    I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

    Check the tires

    A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

    Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

    Test Flight Alpha: Initial Cluster Provisioning

    A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

    I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

    Test Flight Beta: Node Replacement

    The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

    During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

    Quick Explanation

    I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

    Back to it..

    As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

    Well, deleting a node returns its IP to the pool, so the process looks something like this:

    • First new node provisioned (IP .45 assigned)
    • First old node deleted (return IP .25 to the pool)
    • Second new node provisioned (IP .25 assigned)

    My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

    In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

    Test Flight Charlie: Changing IP assignments

    I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

    Landing it…

    With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

    I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.

  • A big mistake and a bit of bad luck…

    In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

    Permissions are a dangerous thing

    I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

    I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

    Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

    What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

    I thought I got it back…

    After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

    I was wrong…

    Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

    I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

    What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

    Now what?

    I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

    I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

    So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

    There was a catch…

    Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

    My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

    At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.

  • Tech Tips – Moving away from k8s-at-home

    Much of what I learned about Helm charting and running workloads in Kubernetes I credit to the contributors over at k8s-at-home. There expansive chart collection helped me start to jump in to Kubernetes.

    Last year, they announced they were deprecating their repositories. I am not surprised: the sheer volume of charts they had meant they had to keep up to date with the latest releases from a number of vendors. If a vendor changed an image or configuration, well, someone had to fix it. That’s a lot of work for a small group with no real benefit other than “doing good for others.”

    Thankfully, one of their contributors, Bernd Schorgers, continues to maintain a library chart that can be used as a basis for most of the charts I use.

    Wanting to move off of the k8s-at-home charts for good, I spent some time this week migrating to Bernd’s library chart. I created new images for the following charts.

    Hopefully one or more of these templates can help move you off of the k8s-at-home charts.

    A Huge Thanks

    I cannot stress this enough: I owe a huge thanks to the k8s-at-home folks. Their work allowed me to jump into Helm by examining what they had done to understand where I could go. I appreciate their contributions to the community: past, present, and future.

  • Automated RKE2 Cluster Management

    One of the things I like about cloud-hosted Kubernetes solutions is that they take the pain out of node management. My latest home lab goal was to replicate some of that functionality with RKE2.

    Did I do it? Yes. Is there room for improvement? Of course, its a software project.

    The Problem

    With RKE1, I have a documented and very manual process for replacing nodes in my clusters. For RKE1, it shapes up like this:

    1. Provision a new VM.
    2. Add a DNS Entry for the new VM.
    3. Edit the cluster.yml file for that cluster, adding the new VM with the appropriate roles to match the outgoing node.
    4. Run rke up
    5. Edit the cluster.yml file for that cluster to remove the old VM.
    6. Run rke up
    7. Modify the cluster’s ingress-nginx settings, adding the new external IP and removing the old one.
    8. Modify my reverse proxy to reflect the IP Changes
    9. Delete the old VM and its DNS entries.

    Repeat the above process for every node in the cluster. Additionally, because the nodes could have slightly different docker versions or updates, I often found myself provisioning a whole set of VMs at a time and going through this process for all the existing nodes at once. The process was fraught with problems, not the least of which is me remembering things that I had to do.

    A DNS Solution

    I wrote a wrapper API to manage Windows DNS settings, and built calls to that wrapper into my Unifi Controller API so that, when I provision a new machine or remove an old one, it will add or remove the fixed IP from Unifi AND add or remove the appropriate DNS record for the machine.

    Since I made DNS entries easier to manage, I also came up with a DNS naming scheme to help manage cluster traffic:

    1. Every control plane node gets an A record with cp-<cluster name>.gerega.net. This lets my kubeconfig files remain unchanged, and traffic is distributed across the control plane nodes via round robin DNS.
    2. Every node gets an A record with tfx-<cluster name>.gerega.net. This allows me to configure my external reverse proxy to use this hostname instead of an individual IP list. See below for more on this from a reverse proxy perspective.

    That solved most of my DNS problems, but I still had issues with the various rke up runs and compatibility worries.

    Automating with RKE2

    The provisioning process for RKE2 is much simpler than that for RKE1. I was able to shift the cluster configuration into the Packer provisioning scripts, which allowed me to do more within the associated Powershell scripts. This, coupled with the DNS standards above, mean that I could run one script and end up with a completely provisioned RKE2 cluster.

    I quickly realized that adding and removing clusters to/from the RKE2 clusters was equally easy. Adding nodes to the cluster simply meant provisioning a new VM with the appropriate scripting to install RKE2 and add it to the existing control plane. Removing nodes from the cluster was simple:

    1. Drain the node (kubectl drain)
    2. Delete the node from the cluster (kubectl delete node/<node name>.
    3. Delete the VM (and its associated DNS).

    As long as I had at least one node with the server role running at all times, things worked fine.

    With RKE2, though, I decided to abandon my ingress-nginx installations in favor of using RKE2’s built-in Nginx Ingress. This allows me to skip managing the cluster’s external IPs, as the RKE cluster’s installer handles that for me.

    Proxying with Nginx

    A little over a year ago I posted my updated network diagram, which introduced a hardware proxy in the form of a Raspberry Pi running Nginx. That little box is a workhorse, and plans are in place for a much needed upgrade. However, in the mean time, it works.

    My configuration was heavily IP based: I would configure upstream blocks with each cluster node’s IP set, and then my sites would be configured to proxy to those IPs. Think something like this:

    upstream cluster1 {
      server 10.1.2.50:80;
      server 10.1.2.51:80;
      server 10.1.2.52:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    The issue here is, every time I add or remove a cluster node, I have to mess with this file. My DNS server is setup for round robin DNS, which means I should be able to add new A records with the same host name, and the DNS will cycle through the different servers.

    My worry, though, was the Nginx reverse proxy. If I configure the reverse proxy to a single DNS, will it cache that IP? Nothing to do but test, right? So I changed my configuration as follows:

    upstream cluster1 {
      server tfx-cluster1.gerega.net:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    Everything seemed to work, but how can I know it worked? For that, I dug into my Prometheus metrics.

    Finding where my traffic is going

    I spent a bit of time trying to figure out which metrics made the most sense to see the number of requests coming through each Nginx controller. As luck would have it, I always put a ServiceMonitor on my Nginx applications to make sure Prometheus is collecting data.

    I dug around in the in the Nginx metrics and found nginx_ingress_controller_requests. With some experimentation, I found this query:

    sum(rate(nginx_ingress_controller_requests{cluster="internal"}[2m])) by (instance)

    Looks easy, right? Basically, look at the sum of the rate of incoming requests by instance for a given time. Now, I could clean this up a little and add some rounding and such, but I really did not care about the number: I wanted to make sure that the request across the instances were balanced effectively. I was not disappointed:

    Rate of Incoming Request

    Each line is an Nginx controller pod in my internal cluster. Visually, things look to be balanced quite nicely!

    Yet Another Migration

    With the move to RKE2, I made more work for myself: I need to migrate my clusters from RKE1 to RKE2. With Argo, the migration should be pretty easy, but still, more home lab work.

    I also came out of this with a laundry list of tech tips and other long form posts… I will be busy over the next few weeks.

  • D-N-S Ja!

    With all this talk of home lab cluster provisioning, you might be wondering if I am actually doing any software development at home. As a matter of fact, I am. Just because it is in support of my home lab provisioning does not mean it is not software development!

    Keeping the Lab Tidy

    One of the things that has bothered me in my home lab management is the DNS management. As I provision and remove Linux VMs, having appropriate DNS records for them makes it easy to find them. Generally it makes for a more tidy environment, as I have a list of my machines and their IPs in one place. I have a small Powershell module that uses the DnsServer module in Windows. What I wanted was an API that would allow me to manage my DNS.

    Now, taking a cue from my Hyper-V wrapper, I created a small API that uses the DnsServer module to manage DNS entries. It was fairly easy, and works quite well on my own machine, which has the DnsServer module installed because I have the Remote Server Administrative Toolset installed.

    Location, Location, Location

    When I started looking at where I could host this service, I realized that I could not host it on my hypervisor as I did with the Hyper-V service. My server is running Windows Server 2019 Hyper-V edition, which is a stripped down version of Windows Server meant for hypervisors. That means I am unable to install the DNS Server role on it. Admittedly, I did not try installing RSAT on it, but I have tendency to believe that would not work.

    Since the DnsServer module would be installed by default on my domain controller, I made the decision to host the DNS API on that server. I went about creating an appropriate service account and installed it as a service. Just like the Hyper-V API, the Windows DNS API is available on Github.

    Return to API Management

    At this point, I have API hosted on a few different machines plus the APIs hosted in my home lab clusters. This has forced me to revisit installing an API Management solution at home. Sure, no one else uses my lab, but that is not the point. Right now, I have a “service discovery” problem: where are my APIs, how do I call them, what is their authentication mechanism, etc. This is part of what API Management can solve: I can have a single place to locate and call my APIs. Over the next few weeks I may delve back into Gravitee.io in an effort to re-establish a proper API Management service.

    Going Public, Going Github

    While it may seem like I am “burying the headline,” I am going to start making an effort to go public with more of my code. Why? Well, I have a number of different repositories that might be of use to some folks, even as reference. Plus, well, it keeps me honest: Going public with my code means I have to be good about my own security practices. Look for posts on migration updates as I get to them.

    Going public will most likely mean going Github. Yes, I have some public repositories out in Bitbucket, but Github provides a bit more community and visibility for my work. I am sure I will still keep some repositories in Bitbucket, but for the projects that I want public feedback on, I will shift to Github.

    Pop Culture Reference Section

    The title is a callout to Pitch Perfect 2. You are welcome.

  • Moving On: Testing RKE2 Clusters in the Home Lab

    I spent the better part of the weekend recovering from crashing my RKE clusters last Friday. This put me on a path towards researching new Kubernetes clusters and determining the best path forward for my home lab.

    Intentionally Myopic

    Let me be clear: This is a home lab, whose purpose is not to help be build bulletproof, corporate production-ready clusters. I also do not want to run Minikube on a box somewhere. So, when I approached my “research” (you will see later why I put that term in quotes), I wanted to make sure I did not get bogged down in the minutia of different Kubernetes installs or details. I stuck with Rancher Kubernetes Engine (RKE1) for a long time because it was quick to stand up, relatively stable, and easy to manage.

    So, when I started looking for alternatives, my first research was into whether Rancher had any updated offerings. And, with that, I found RKE2.

    RKE2, aka RKE Government

    I already feel safer knowing that RKE2’s alter ego is RKE Government. All joking aside, as I dug into RKE2, it seemed a good mix of RKE1, which I am used to, and K3s, a lightweight implementation of Kubernetes. The RKE2 documentation was, frankly, much more intuitive and easy to navigate than the RKE1 documentation. I am not sure if it is because the documentation is that much better or if because RKE2 is that much easier to configure.

    I could spend pages upon pages explaining the experiments I ran over the last few evenings, but the proof is in the pudding, as they say. My provisioning-projects repository has a new Powershell script (Create-Rke2Cluster.ps1) that outlines the steps needed to get a cluster configured. My work, then, came down to how I wanted to configure the cluster.

    RKE1 Roles vs RKE2 Server/Agent

    RKE1 had a notion of node roles which were divided into three categories:

    • controlplane – Nodes with this role hosts the Kubernetes APIs
    • etcd – Nodes with this role host the etcd storage containers. There should be an odd number, at least 3 is a good choice.
    • worker – Nodes with this role can run workloads within the cluster.

    My RKE1 clusters typically have the following setup:

    • One node with controlplane, etcd, and worker roles.
    • Two nodes with etcd and worker roles.
    • If needed, additional nodes with just the worker role.

    This seemed to work well: I had proper redundancy with etcd and enough workers to host all of my workloads. Sure, I only had one control plane, so if that node went down, well, the cluster would be in trouble. However, I usually did not have much problem with keeping the nodes running so I left it as it stood.

    With RKE2, there is simply a notion of server and agent. The server node runs etcd and the control plane components, while agents run only user defined workloads. So, when I started planning my RKE2 clusters, I figured I would run one server and two agents. The lack of etcd redundancy would not have me losing sleep at night, but I really did not want to run 3 servers and then more agents for my workloads.

    As I started down this road, I wondered how I would be able to cycle nodes. I asked the #rke2 channel on rancher-users.slack.com, and got an answer from Brad Davidson: I should always have at least 2 available servers, even when cycling. However, he did mention something that was not immediately apparent: the server can and will run user-defined workloads unless the appropriate taints have been applied. So, in that sense, an RKE2 server acts similarly to my “all roles” node, where it functions as a control plane, etcd, and worker node.

    The Verdict?

    Once I saw a path forward with RKE2, I really have not looked back. I have put considerable time into my provisioning projects scripts, as well as creating a new API wrapper for Windows DNS management (post to follow).

    “But Matt, you haven’t considered Kuberenetes X or Y?”

    I know. There are a number of flavors of Kubernetes that can run your bare metal servers. I spent a lot of time and energy in learning RKE1, and I have gotten very good at managing those clusters. RKE2 is familiar, with improvements in all the right places. I can see automating not only machine provisioning, but the entire process of node replacement. I would love nothing more than to come downstairs on a Monday morning and see newly provisioned cluster nodes humming away after my automated process ran.

    So, yes, maybe I skipped a good portion of that “research” step, but I am ok with it. After all, it is my home lab: I am more interested in re-visiting Gravitee.io for API management and starting to put some real code services out in the world.

  • Home Lab – No More iSCSI – Backup Plans

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    It is worth nothing (and quite ironic) that I went through a fire drill last week when I crashed my RKE clusters. That event gave me some fresh eyes into the data that is important to me.

    How much redundancy do I need?

    I have been relying primarily on the redundancy of the Synology for a bit too long. The volume has the capability to lose a disk and the Synology has been very stable, but that does not mean I should leave things as they are.

    There are many layers of redundancy, and for a home lab, it is about making decisions as to how much you are willing to pay and what you are willing to lose.

    No Copy, Onsite Copy, Offsite Copy

    I prefer not to spend a ton of time thinking about all of this, so I created three “buckets” for data priority:

    • No Backup: Synology redundancy is sufficient. If I lose it, I lose it.
    • Onsite Copy: Create another copy of the data somewhere at home. For this, I am going to attach a USB enclosure with a 2TB disk to my Synology and setup USB copy tasks on the Diskstation Manager (DSM).
    • Offsite Copy: Ship the data offsite for safety. I have been using Backblaze B2 buckets and the DSM’s Cloud Sync for personal documents for years, but the time has come to scale up a bit.

    It is worth noting that some things may be bucketed into both Onsite and Offsite, depending on how critically I need the data. With the inventory I took over the last few weekds, I had some decisions to make.

    • Domain Controllers -> OnSite copy for sure. I am not yet sure if I want to add an Offsite copy, though: The domain doesn’t have enough on it that it cannot be rebuilt quickly, and there are really only a handful of machines on it. It just makes managing Windows Servers much easier.
    • Kubernetes NFS Data -> I use nfs-subdir-external-provisioner to provide persistent storage for my Kubernetes clusters. I will certainly do OnSite copies of this data, but for the most important ones (such as this blog), I will also setup an offsite transfer.
    • SQL Server Data -> The SQL Server data is being stored on an iSCSI LUN, but I configured regular backups to go to a file share on the Synology. From there, OnSite backups should be sufficient.
    • Personal Stuff -> I have a lot of personal data (photos, financial data, etc.) stored on the Synology. That data is already encrypted and sent to Backblaze, but I may add another layer of redundancy and do an Onsite copy of them as well.

    Solutioning

    Honestly, I thought this would be harder, but Synology’s DSM and available packages really made it easy.

    1. VM Backups with Active Backup for Business: Installed Active Backup for Business, setup a connection to my Hyper-V server, picked the machines I wanted to backup…. It really was that simple. I should test a recovery, but on a test VM.
    2. Onsite Copies with USB Copy: I plugged an external HD into the Synology, which was immediately recognized and a file share created. I installed the USB Copy package and started configuring tasks. Basically, I can setup copy tasks to move data from the Synology to the USB as desired, and includes various settings, such as incremental or versioned backups, triggers, and file filters.
    3. SQL Backups: I had to refresh my memory on scheduling SQL backups in SQL Server. Once I had that done, I just made sure to back them up to a share on the Synology. From there, USB Copy took care of the rest.
    4. Offsite: As I mentioned, I have had Cloud Sync running to Backblaze B2 buckets for a while. All I did was expand my copying. Cloud Sync offers some of the same flexibility as USB Copy, but having well-structured file shares for your data makes it easier to select and push data as you want it.

    Results and What’s next

    My home lab refresh took me about 2 weeks, albeit during a few evenings across that time span. What I am left with is a much more performant server. While I still store data on the Synology via NFS and iSCSi, it’s only smaller parts that are less reliant on fast access. The VM disks live on an SSD RAID array on the server, which gives me added stability and less thrashing of the Synology and its SSD cache. This is no more evident than the fact that my average daily SSD temp has gone down 12°F over the last 2 weeks.

    What’s next? I will be taking a look at alternatives to Rancher Kubernetes Engine. I am hoping to find something a bit more stable and secure to manage.

  • Nothing says “Friday Night Fun” like crashing an RKE Cluster!

    Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

    It all started with an upgrade…

    All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

    Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

    Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

    First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

    I lucked out

    Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

    Cleaning Up

    Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

    There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

    The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

    This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

    Lessons Learned

    For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

    Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

    I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.

  • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning
    • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild (this post)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    Observations – Migrating Servers

    The focus of my hobby time over the few days has been moving production assets to the temporary server. Most of it is fairly vanilla, but I have a few observations worth noting.

    • I forgot how easy it was to replicate and failover VMs with Hyper-V. Sure, I could have tried a live migration, but creating a replica, shutting down the machine, and failing over was painless.
    • Do not forget to provision an external virtual switch on your Hyper-V servers. Yes, sounds stupid, but, I dove right in to setting the temporary server up as a replication server, and upon trying to failover, realized that the machine on the new server did not have a network connection.
    • I moved my Minio instance to the Synology: I originally had my Minio server running on an Ubuntu VM on my hypervisor, but decided moving the storage application closer to the storage medium was generally a good idea.
    • For my Kubernetes nodes, it was easier to provision new nodes on the temp server than it was to do a live migration or planned failover. I followed my normal process for provisioning new nodes and decommissioning old ones, and viola, my production cluster is on the temporary server. I will simply reverse the process for the transfer back.
    • I am getting noticeably better performance on the temporary server, which has far less compute and RAM, but the VMs are on local disks. While the Synology has been a rock solid, I think I have been throwing too much at it, and it can slow down from time to time.

    Let me be clear: My network storage is by no means bad, and it will be utilized. But storing the primary vhdx files for my VMs on the hypervisor provides much better performance.

    Shut It Down!

    After successfully moving my production assets over to the temporary server, it was time to shut it down. I shut down the VMs that remained on the original hypervisor and attempted to copy the VMs to a network drive on the Synology. That was a giant mistake.

    Those VM files already live on the Synology as part of an iSCSI volume. By trying to pull those files off of the iSCSI drive and copy them back to the Synology, I was basically doing a huge file copy (like, 600+ GB huge) without the systems really knowing it was copy. As you can imagine, the performance was terrible.

    I found a 600TB SAS drive that I was able to plug into the old hypervisor, and I used that as a temporary location for the copy. Even with that change, the copy took a while (I think about 3 hours).

    Upgrade and Install

    I mounted my new SSDs (Samsung EVO 1TB) in some drive trays and plugged them into the server. A quick boot to the Smart Storage administrator let me setup a new drive array. While I thought about just using RAID 0 and letting me have 2 TB of stuff, I went the safe option and used RAID 1.

    Having configured the temporary server with Windows Server Hyper-V 2019, the process of doing it again was, well, pretty standard. I booted to the USB stick I created earlier for Hyper-V 2019 and went through the paces. My domain controller was still live (thanks temporary server!), so I was able to add the machine to domain and then perform all of the management via the Server Manager tool on my laptop.

    Moving back in

    I have the server back up with a nice new 1TB drive for my VMs. That’s a far cry from the 4 TB of storage I had allocated on the SAN target on the Synology, so I have to be more careful with my storage.

    Now, if I set a Hyper-V disk to, say, 100Gb, Hyper-V does not actually provision a file that is 100Gb: the vhdx file grows with time. But that does not mean I should just mindlessly provision disk space on my VMs.

    For my Kubernetes nodes, looking at my usage, 50GB is more than enough for those disks. All persistent storage for those workloads is handled by an NFS provisioner which configures shares on the Synology. As for the domain controllers, I am able to run with minimal storage because, well, it is a tiny domain.

    The problem children are Minio and my SQL Server Databases. Minio I covered above, moving it to the Synology directly. SQL Server, however, is a different animal.

    Why be you, when you can be new!

    I already had my production SQL instance running on another server. Rather than move it around and then mess with storage, I felt the safer solution was to provision a new SQL Server instance and migrate my databases. I only have 4 databases on that server, so moving databases is not a monumental task.

    A new server affords me two things:

    1. Latest and greatest version of Windows and SQL server.
    2. Minimal storage on the hypervisor disk itself. I provisioned only about 80 GB for the main virtual disk. This worked fine, except that I ran into a storage compatibility issue that needed a small workaround.

    SMB 3.0, but only certain ones

    My original intent was to create a virtual disk on a network share on the Synology, and mount that disk to the new SQL Server VM. That way, to the SQL Server, the storage is local, but the SQL data would be on the Synology.

    Hyper-V did not like this. I was able to create a vhdx file on a share just fine, but when I tried to add it to a VM using Add-VMHardDiskDrive, I got the following error:

    Remote SMB share does not support resiliency.

    A quick Google search turned up this Spiceworks question, where the only answer suggests that the Synology SMB 3.0 implementation is Linux-based, where Hyper-V is looking to use the Windows-based implementation, and that there are things missing in Linux.

    While I am usually not one to take one answer and call it fact, I also didn’t want to spend too much time getting into the nitty gritty. I knew it was a possibility that this wasn’t going to work, and, in the interest of time, I went back to my old pal iSCSI. I provisioned a small iSCSI LUN (300 GB) and mounted directly in the virtual machine. So now my SQL Server has a data drive that uses the Synology for storage.

    And we’re back!

    Moves like this provide an opportunity for consolidation, updates, and improvements, and I seized some of those opportunities:

    • I provisioned new Active Directory Domain controllers on updated operating systems, switched over, and deleted the old one.
    • I moved Minio to my Synology, and moved Hashicorp Vault to my Kubernetes cluster (using Minio as a storage backend). This removed 2 virtual machines from the hypervisor.
    • I provisioned a new SQL Server and migrated my production databases to it.
    • Compared to the rats nest of network configuration I had, the networking on the hypervisor is much simpler:
      • 1 standard NIC with a static IP so that I can get in and out of the hypervisor itself.
      • 1 teamed NIC with a static IP attached to the Hyper-V Virtual Switch.
    • For the moment, I did not bring back my “non-production” cluster. It was only running test/stage environments of some of my home projects. For the time being, I will most likely move these workloads to my internal cluster.

    I was able to shut down the temporary server, meaning, at least in my mind, I am back to where I was. However, now that I have things on the hypervisor itself, my next step is to ensure I am appropriately backing things up. I will finish this series with a post on my backup configuration.