• A big mistake and a bit of bad luck…

    In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

    Permissions are a dangerous thing

    I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

    I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

    Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

    What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

    I thought I got it back…

    After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

    I was wrong…

    Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

    I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

    What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

    Now what?

    I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

    I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

    So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

    There was a catch…

    Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

    My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

    At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.

  • Tech Tips – Moving away from k8s-at-home

    Much of what I learned about Helm charting and running workloads in Kubernetes I credit to the contributors over at k8s-at-home. There expansive chart collection helped me start to jump in to Kubernetes.

    Last year, they announced they were deprecating their repositories. I am not surprised: the sheer volume of charts they had meant they had to keep up to date with the latest releases from a number of vendors. If a vendor changed an image or configuration, well, someone had to fix it. That’s a lot of work for a small group with no real benefit other than “doing good for others.”

    Thankfully, one of their contributors, Bernd Schorgers, continues to maintain a library chart that can be used as a basis for most of the charts I use.

    Wanting to move off of the k8s-at-home charts for good, I spent some time this week migrating to Bernd’s library chart. I created new images for the following charts.

    Hopefully one or more of these templates can help move you off of the k8s-at-home charts.

    A Huge Thanks

    I cannot stress this enough: I owe a huge thanks to the k8s-at-home folks. Their work allowed me to jump into Helm by examining what they had done to understand where I could go. I appreciate their contributions to the community: past, present, and future.

  • Automated RKE2 Cluster Management

    One of the things I like about cloud-hosted Kubernetes solutions is that they take the pain out of node management. My latest home lab goal was to replicate some of that functionality with RKE2.

    Did I do it? Yes. Is there room for improvement? Of course, its a software project.

    The Problem

    With RKE1, I have a documented and very manual process for replacing nodes in my clusters. For RKE1, it shapes up like this:

    1. Provision a new VM.
    2. Add a DNS Entry for the new VM.
    3. Edit the cluster.yml file for that cluster, adding the new VM with the appropriate roles to match the outgoing node.
    4. Run rke up
    5. Edit the cluster.yml file for that cluster to remove the old VM.
    6. Run rke up
    7. Modify the cluster’s ingress-nginx settings, adding the new external IP and removing the old one.
    8. Modify my reverse proxy to reflect the IP Changes
    9. Delete the old VM and its DNS entries.

    Repeat the above process for every node in the cluster. Additionally, because the nodes could have slightly different docker versions or updates, I often found myself provisioning a whole set of VMs at a time and going through this process for all the existing nodes at once. The process was fraught with problems, not the least of which is me remembering things that I had to do.

    A DNS Solution

    I wrote a wrapper API to manage Windows DNS settings, and built calls to that wrapper into my Unifi Controller API so that, when I provision a new machine or remove an old one, it will add or remove the fixed IP from Unifi AND add or remove the appropriate DNS record for the machine.

    Since I made DNS entries easier to manage, I also came up with a DNS naming scheme to help manage cluster traffic:

    1. Every control plane node gets an A record with cp-<cluster name>.gerega.net. This lets my kubeconfig files remain unchanged, and traffic is distributed across the control plane nodes via round robin DNS.
    2. Every node gets an A record with tfx-<cluster name>.gerega.net. This allows me to configure my external reverse proxy to use this hostname instead of an individual IP list. See below for more on this from a reverse proxy perspective.

    That solved most of my DNS problems, but I still had issues with the various rke up runs and compatibility worries.

    Automating with RKE2

    The provisioning process for RKE2 is much simpler than that for RKE1. I was able to shift the cluster configuration into the Packer provisioning scripts, which allowed me to do more within the associated Powershell scripts. This, coupled with the DNS standards above, mean that I could run one script and end up with a completely provisioned RKE2 cluster.

    I quickly realized that adding and removing clusters to/from the RKE2 clusters was equally easy. Adding nodes to the cluster simply meant provisioning a new VM with the appropriate scripting to install RKE2 and add it to the existing control plane. Removing nodes from the cluster was simple:

    1. Drain the node (kubectl drain)
    2. Delete the node from the cluster (kubectl delete node/<node name>.
    3. Delete the VM (and its associated DNS).

    As long as I had at least one node with the server role running at all times, things worked fine.

    With RKE2, though, I decided to abandon my ingress-nginx installations in favor of using RKE2’s built-in Nginx Ingress. This allows me to skip managing the cluster’s external IPs, as the RKE cluster’s installer handles that for me.

    Proxying with Nginx

    A little over a year ago I posted my updated network diagram, which introduced a hardware proxy in the form of a Raspberry Pi running Nginx. That little box is a workhorse, and plans are in place for a much needed upgrade. However, in the mean time, it works.

    My configuration was heavily IP based: I would configure upstream blocks with each cluster node’s IP set, and then my sites would be configured to proxy to those IPs. Think something like this:

    upstream cluster1 {
      server 10.1.2.50:80;
      server 10.1.2.51:80;
      server 10.1.2.52:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    The issue here is, every time I add or remove a cluster node, I have to mess with this file. My DNS server is setup for round robin DNS, which means I should be able to add new A records with the same host name, and the DNS will cycle through the different servers.

    My worry, though, was the Nginx reverse proxy. If I configure the reverse proxy to a single DNS, will it cache that IP? Nothing to do but test, right? So I changed my configuration as follows:

    upstream cluster1 {
      server tfx-cluster1.gerega.net:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    Everything seemed to work, but how can I know it worked? For that, I dug into my Prometheus metrics.

    Finding where my traffic is going

    I spent a bit of time trying to figure out which metrics made the most sense to see the number of requests coming through each Nginx controller. As luck would have it, I always put a ServiceMonitor on my Nginx applications to make sure Prometheus is collecting data.

    I dug around in the in the Nginx metrics and found nginx_ingress_controller_requests. With some experimentation, I found this query:

    sum(rate(nginx_ingress_controller_requests{cluster="internal"}[2m])) by (instance)

    Looks easy, right? Basically, look at the sum of the rate of incoming requests by instance for a given time. Now, I could clean this up a little and add some rounding and such, but I really did not care about the number: I wanted to make sure that the request across the instances were balanced effectively. I was not disappointed:

    Rate of Incoming Request

    Each line is an Nginx controller pod in my internal cluster. Visually, things look to be balanced quite nicely!

    Yet Another Migration

    With the move to RKE2, I made more work for myself: I need to migrate my clusters from RKE1 to RKE2. With Argo, the migration should be pretty easy, but still, more home lab work.

    I also came out of this with a laundry list of tech tips and other long form posts… I will be busy over the next few weeks.

  • D-N-S Ja!

    With all this talk of home lab cluster provisioning, you might be wondering if I am actually doing any software development at home. As a matter of fact, I am. Just because it is in support of my home lab provisioning does not mean it is not software development!

    Keeping the Lab Tidy

    One of the things that has bothered me in my home lab management is the DNS management. As I provision and remove Linux VMs, having appropriate DNS records for them makes it easy to find them. Generally it makes for a more tidy environment, as I have a list of my machines and their IPs in one place. I have a small Powershell module that uses the DnsServer module in Windows. What I wanted was an API that would allow me to manage my DNS.

    Now, taking a cue from my Hyper-V wrapper, I created a small API that uses the DnsServer module to manage DNS entries. It was fairly easy, and works quite well on my own machine, which has the DnsServer module installed because I have the Remote Server Administrative Toolset installed.

    Location, Location, Location

    When I started looking at where I could host this service, I realized that I could not host it on my hypervisor as I did with the Hyper-V service. My server is running Windows Server 2019 Hyper-V edition, which is a stripped down version of Windows Server meant for hypervisors. That means I am unable to install the DNS Server role on it. Admittedly, I did not try installing RSAT on it, but I have tendency to believe that would not work.

    Since the DnsServer module would be installed by default on my domain controller, I made the decision to host the DNS API on that server. I went about creating an appropriate service account and installed it as a service. Just like the Hyper-V API, the Windows DNS API is available on Github.

    Return to API Management

    At this point, I have API hosted on a few different machines plus the APIs hosted in my home lab clusters. This has forced me to revisit installing an API Management solution at home. Sure, no one else uses my lab, but that is not the point. Right now, I have a “service discovery” problem: where are my APIs, how do I call them, what is their authentication mechanism, etc. This is part of what API Management can solve: I can have a single place to locate and call my APIs. Over the next few weeks I may delve back into Gravitee.io in an effort to re-establish a proper API Management service.

    Going Public, Going Github

    While it may seem like I am “burying the headline,” I am going to start making an effort to go public with more of my code. Why? Well, I have a number of different repositories that might be of use to some folks, even as reference. Plus, well, it keeps me honest: Going public with my code means I have to be good about my own security practices. Look for posts on migration updates as I get to them.

    Going public will most likely mean going Github. Yes, I have some public repositories out in Bitbucket, but Github provides a bit more community and visibility for my work. I am sure I will still keep some repositories in Bitbucket, but for the projects that I want public feedback on, I will shift to Github.

    Pop Culture Reference Section

    The title is a callout to Pitch Perfect 2. You are welcome.

  • Moving On: Testing RKE2 Clusters in the Home Lab

    I spent the better part of the weekend recovering from crashing my RKE clusters last Friday. This put me on a path towards researching new Kubernetes clusters and determining the best path forward for my home lab.

    Intentionally Myopic

    Let me be clear: This is a home lab, whose purpose is not to help be build bulletproof, corporate production-ready clusters. I also do not want to run Minikube on a box somewhere. So, when I approached my “research” (you will see later why I put that term in quotes), I wanted to make sure I did not get bogged down in the minutia of different Kubernetes installs or details. I stuck with Rancher Kubernetes Engine (RKE1) for a long time because it was quick to stand up, relatively stable, and easy to manage.

    So, when I started looking for alternatives, my first research was into whether Rancher had any updated offerings. And, with that, I found RKE2.

    RKE2, aka RKE Government

    I already feel safer knowing that RKE2’s alter ego is RKE Government. All joking aside, as I dug into RKE2, it seemed a good mix of RKE1, which I am used to, and K3s, a lightweight implementation of Kubernetes. The RKE2 documentation was, frankly, much more intuitive and easy to navigate than the RKE1 documentation. I am not sure if it is because the documentation is that much better or if because RKE2 is that much easier to configure.

    I could spend pages upon pages explaining the experiments I ran over the last few evenings, but the proof is in the pudding, as they say. My provisioning-projects repository has a new Powershell script (Create-Rke2Cluster.ps1) that outlines the steps needed to get a cluster configured. My work, then, came down to how I wanted to configure the cluster.

    RKE1 Roles vs RKE2 Server/Agent

    RKE1 had a notion of node roles which were divided into three categories:

    • controlplane – Nodes with this role hosts the Kubernetes APIs
    • etcd – Nodes with this role host the etcd storage containers. There should be an odd number, at least 3 is a good choice.
    • worker – Nodes with this role can run workloads within the cluster.

    My RKE1 clusters typically have the following setup:

    • One node with controlplane, etcd, and worker roles.
    • Two nodes with etcd and worker roles.
    • If needed, additional nodes with just the worker role.

    This seemed to work well: I had proper redundancy with etcd and enough workers to host all of my workloads. Sure, I only had one control plane, so if that node went down, well, the cluster would be in trouble. However, I usually did not have much problem with keeping the nodes running so I left it as it stood.

    With RKE2, there is simply a notion of server and agent. The server node runs etcd and the control plane components, while agents run only user defined workloads. So, when I started planning my RKE2 clusters, I figured I would run one server and two agents. The lack of etcd redundancy would not have me losing sleep at night, but I really did not want to run 3 servers and then more agents for my workloads.

    As I started down this road, I wondered how I would be able to cycle nodes. I asked the #rke2 channel on rancher-users.slack.com, and got an answer from Brad Davidson: I should always have at least 2 available servers, even when cycling. However, he did mention something that was not immediately apparent: the server can and will run user-defined workloads unless the appropriate taints have been applied. So, in that sense, an RKE2 server acts similarly to my “all roles” node, where it functions as a control plane, etcd, and worker node.

    The Verdict?

    Once I saw a path forward with RKE2, I really have not looked back. I have put considerable time into my provisioning projects scripts, as well as creating a new API wrapper for Windows DNS management (post to follow).

    “But Matt, you haven’t considered Kuberenetes X or Y?”

    I know. There are a number of flavors of Kubernetes that can run your bare metal servers. I spent a lot of time and energy in learning RKE1, and I have gotten very good at managing those clusters. RKE2 is familiar, with improvements in all the right places. I can see automating not only machine provisioning, but the entire process of node replacement. I would love nothing more than to come downstairs on a Monday morning and see newly provisioned cluster nodes humming away after my automated process ran.

    So, yes, maybe I skipped a good portion of that “research” step, but I am ok with it. After all, it is my home lab: I am more interested in re-visiting Gravitee.io for API management and starting to put some real code services out in the world.

  • Home Lab – No More iSCSI – Backup Plans

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    It is worth nothing (and quite ironic) that I went through a fire drill last week when I crashed my RKE clusters. That event gave me some fresh eyes into the data that is important to me.

    How much redundancy do I need?

    I have been relying primarily on the redundancy of the Synology for a bit too long. The volume has the capability to lose a disk and the Synology has been very stable, but that does not mean I should leave things as they are.

    There are many layers of redundancy, and for a home lab, it is about making decisions as to how much you are willing to pay and what you are willing to lose.

    No Copy, Onsite Copy, Offsite Copy

    I prefer not to spend a ton of time thinking about all of this, so I created three “buckets” for data priority:

    • No Backup: Synology redundancy is sufficient. If I lose it, I lose it.
    • Onsite Copy: Create another copy of the data somewhere at home. For this, I am going to attach a USB enclosure with a 2TB disk to my Synology and setup USB copy tasks on the Diskstation Manager (DSM).
    • Offsite Copy: Ship the data offsite for safety. I have been using Backblaze B2 buckets and the DSM’s Cloud Sync for personal documents for years, but the time has come to scale up a bit.

    It is worth noting that some things may be bucketed into both Onsite and Offsite, depending on how critically I need the data. With the inventory I took over the last few weekds, I had some decisions to make.

    • Domain Controllers -> OnSite copy for sure. I am not yet sure if I want to add an Offsite copy, though: The domain doesn’t have enough on it that it cannot be rebuilt quickly, and there are really only a handful of machines on it. It just makes managing Windows Servers much easier.
    • Kubernetes NFS Data -> I use nfs-subdir-external-provisioner to provide persistent storage for my Kubernetes clusters. I will certainly do OnSite copies of this data, but for the most important ones (such as this blog), I will also setup an offsite transfer.
    • SQL Server Data -> The SQL Server data is being stored on an iSCSI LUN, but I configured regular backups to go to a file share on the Synology. From there, OnSite backups should be sufficient.
    • Personal Stuff -> I have a lot of personal data (photos, financial data, etc.) stored on the Synology. That data is already encrypted and sent to Backblaze, but I may add another layer of redundancy and do an Onsite copy of them as well.

    Solutioning

    Honestly, I thought this would be harder, but Synology’s DSM and available packages really made it easy.

    1. VM Backups with Active Backup for Business: Installed Active Backup for Business, setup a connection to my Hyper-V server, picked the machines I wanted to backup…. It really was that simple. I should test a recovery, but on a test VM.
    2. Onsite Copies with USB Copy: I plugged an external HD into the Synology, which was immediately recognized and a file share created. I installed the USB Copy package and started configuring tasks. Basically, I can setup copy tasks to move data from the Synology to the USB as desired, and includes various settings, such as incremental or versioned backups, triggers, and file filters.
    3. SQL Backups: I had to refresh my memory on scheduling SQL backups in SQL Server. Once I had that done, I just made sure to back them up to a share on the Synology. From there, USB Copy took care of the rest.
    4. Offsite: As I mentioned, I have had Cloud Sync running to Backblaze B2 buckets for a while. All I did was expand my copying. Cloud Sync offers some of the same flexibility as USB Copy, but having well-structured file shares for your data makes it easier to select and push data as you want it.

    Results and What’s next

    My home lab refresh took me about 2 weeks, albeit during a few evenings across that time span. What I am left with is a much more performant server. While I still store data on the Synology via NFS and iSCSi, it’s only smaller parts that are less reliant on fast access. The VM disks live on an SSD RAID array on the server, which gives me added stability and less thrashing of the Synology and its SSD cache. This is no more evident than the fact that my average daily SSD temp has gone down 12°F over the last 2 weeks.

    What’s next? I will be taking a look at alternatives to Rancher Kubernetes Engine. I am hoping to find something a bit more stable and secure to manage.

  • Nothing says “Friday Night Fun” like crashing an RKE Cluster!

    Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

    It all started with an upgrade…

    All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

    Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

    Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

    First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

    I lucked out

    Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

    Cleaning Up

    Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

    There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

    The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

    This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

    Lessons Learned

    For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

    Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

    I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.

  • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning
    • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild (this post)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    Observations – Migrating Servers

    The focus of my hobby time over the few days has been moving production assets to the temporary server. Most of it is fairly vanilla, but I have a few observations worth noting.

    • I forgot how easy it was to replicate and failover VMs with Hyper-V. Sure, I could have tried a live migration, but creating a replica, shutting down the machine, and failing over was painless.
    • Do not forget to provision an external virtual switch on your Hyper-V servers. Yes, sounds stupid, but, I dove right in to setting the temporary server up as a replication server, and upon trying to failover, realized that the machine on the new server did not have a network connection.
    • I moved my Minio instance to the Synology: I originally had my Minio server running on an Ubuntu VM on my hypervisor, but decided moving the storage application closer to the storage medium was generally a good idea.
    • For my Kubernetes nodes, it was easier to provision new nodes on the temp server than it was to do a live migration or planned failover. I followed my normal process for provisioning new nodes and decommissioning old ones, and viola, my production cluster is on the temporary server. I will simply reverse the process for the transfer back.
    • I am getting noticeably better performance on the temporary server, which has far less compute and RAM, but the VMs are on local disks. While the Synology has been a rock solid, I think I have been throwing too much at it, and it can slow down from time to time.

    Let me be clear: My network storage is by no means bad, and it will be utilized. But storing the primary vhdx files for my VMs on the hypervisor provides much better performance.

    Shut It Down!

    After successfully moving my production assets over to the temporary server, it was time to shut it down. I shut down the VMs that remained on the original hypervisor and attempted to copy the VMs to a network drive on the Synology. That was a giant mistake.

    Those VM files already live on the Synology as part of an iSCSI volume. By trying to pull those files off of the iSCSI drive and copy them back to the Synology, I was basically doing a huge file copy (like, 600+ GB huge) without the systems really knowing it was copy. As you can imagine, the performance was terrible.

    I found a 600TB SAS drive that I was able to plug into the old hypervisor, and I used that as a temporary location for the copy. Even with that change, the copy took a while (I think about 3 hours).

    Upgrade and Install

    I mounted my new SSDs (Samsung EVO 1TB) in some drive trays and plugged them into the server. A quick boot to the Smart Storage administrator let me setup a new drive array. While I thought about just using RAID 0 and letting me have 2 TB of stuff, I went the safe option and used RAID 1.

    Having configured the temporary server with Windows Server Hyper-V 2019, the process of doing it again was, well, pretty standard. I booted to the USB stick I created earlier for Hyper-V 2019 and went through the paces. My domain controller was still live (thanks temporary server!), so I was able to add the machine to domain and then perform all of the management via the Server Manager tool on my laptop.

    Moving back in

    I have the server back up with a nice new 1TB drive for my VMs. That’s a far cry from the 4 TB of storage I had allocated on the SAN target on the Synology, so I have to be more careful with my storage.

    Now, if I set a Hyper-V disk to, say, 100Gb, Hyper-V does not actually provision a file that is 100Gb: the vhdx file grows with time. But that does not mean I should just mindlessly provision disk space on my VMs.

    For my Kubernetes nodes, looking at my usage, 50GB is more than enough for those disks. All persistent storage for those workloads is handled by an NFS provisioner which configures shares on the Synology. As for the domain controllers, I am able to run with minimal storage because, well, it is a tiny domain.

    The problem children are Minio and my SQL Server Databases. Minio I covered above, moving it to the Synology directly. SQL Server, however, is a different animal.

    Why be you, when you can be new!

    I already had my production SQL instance running on another server. Rather than move it around and then mess with storage, I felt the safer solution was to provision a new SQL Server instance and migrate my databases. I only have 4 databases on that server, so moving databases is not a monumental task.

    A new server affords me two things:

    1. Latest and greatest version of Windows and SQL server.
    2. Minimal storage on the hypervisor disk itself. I provisioned only about 80 GB for the main virtual disk. This worked fine, except that I ran into a storage compatibility issue that needed a small workaround.

    SMB 3.0, but only certain ones

    My original intent was to create a virtual disk on a network share on the Synology, and mount that disk to the new SQL Server VM. That way, to the SQL Server, the storage is local, but the SQL data would be on the Synology.

    Hyper-V did not like this. I was able to create a vhdx file on a share just fine, but when I tried to add it to a VM using Add-VMHardDiskDrive, I got the following error:

    Remote SMB share does not support resiliency.

    A quick Google search turned up this Spiceworks question, where the only answer suggests that the Synology SMB 3.0 implementation is Linux-based, where Hyper-V is looking to use the Windows-based implementation, and that there are things missing in Linux.

    While I am usually not one to take one answer and call it fact, I also didn’t want to spend too much time getting into the nitty gritty. I knew it was a possibility that this wasn’t going to work, and, in the interest of time, I went back to my old pal iSCSI. I provisioned a small iSCSI LUN (300 GB) and mounted directly in the virtual machine. So now my SQL Server has a data drive that uses the Synology for storage.

    And we’re back!

    Moves like this provide an opportunity for consolidation, updates, and improvements, and I seized some of those opportunities:

    • I provisioned new Active Directory Domain controllers on updated operating systems, switched over, and deleted the old one.
    • I moved Minio to my Synology, and moved Hashicorp Vault to my Kubernetes cluster (using Minio as a storage backend). This removed 2 virtual machines from the hypervisor.
    • I provisioned a new SQL Server and migrated my production databases to it.
    • Compared to the rats nest of network configuration I had, the networking on the hypervisor is much simpler:
      • 1 standard NIC with a static IP so that I can get in and out of the hypervisor itself.
      • 1 teamed NIC with a static IP attached to the Hyper-V Virtual Switch.
    • For the moment, I did not bring back my “non-production” cluster. It was only running test/stage environments of some of my home projects. For the time being, I will most likely move these workloads to my internal cluster.

    I was able to shut down the temporary server, meaning, at least in my mind, I am back to where I was. However, now that I have things on the hypervisor itself, my next step is to ensure I am appropriately backing things up. I will finish this series with a post on my backup configuration.

  • Installing Minio on a Synology Diskstation with Nginx SSL

    In an effort to get rid of a virtual machine on my hypervisor, I wanted to move my Minio instance to my Synology. Keeping the storage interface close to the storage container helps with latency and is, well, one less thing I have to worry about in my home lab.

    There are a few guides out there for installing Minio on a Synology. Jaroensak Yodkantha walks you through the full process of setting up the Synology and Minio using a docker command line. The folks over at BackupAssist show you how to configure Minio through the Diskstation Manager web portal. I used the BackupAssist article to get myself started, but found myself tweaking the setup because I want to have SSL communication available through my Nginx reverse proxy.

    The Basics

    Prep Work

    I went in to the Shared Folder section of the DSM control panel and created a new shared folder called minio. The settings on this share are pretty much up to you, but I did this so that all of my Minio data was in a known location.

    Within the minio folder, I created a data folder and a blank text file called minio. Inside the minio file, I setup my minio configuration:

    # MINIO_ROOT_USER and MINIO_ROOT_PASSWORD sets the root account for the MinIO server.
    # This user has unrestricted permissions to perform S3 and administrative API operations on any resource in the deployment.
    # Omit to use the default values 'minioadmin:minioadmin'.
    # MinIO recommends setting non-default values as a best practice, regardless of environment
    
    MINIO_ROOT_USER=myadmin
    MINIO_ROOT_PASSWORD=myadminpassword
    
    # MINIO_VOLUMES sets the storage volume or path to use for the MinIO server.
    
    MINIO_VOLUMES="/mnt/data"
    
    # MINIO_SERVER_URL sets the hostname of the local machine for use with the MinIO Server
    # MinIO assumes your network control plane can correctly resolve this hostname to the local machine
    
    # Uncomment the following line and replace the value with the correct hostname for the local machine.
    
    MINIO_SERVER_URL="https://s3.mattsdatacenter.net"
    MINIO_BROWSER_REDIRECT_URL="https://storage.mattsdatacenter.net"

    It is worth noting the URLs: I want to put this system behind my Nginx reverse proxy and let it do SSL termination, and in order to do that, I found it easiest to use two domains: one for the API and one for the Console. I will get into more details on that later.

    Also, as always, change your admin username and password!

    Setup the Container

    Following the BackupAssist article, I installed the Docker package on to my Synology and opened it up. From the Registry menu, I searched for minio and found the minio/minio image:

    Click on the row to highlight it, and click on the Download button. You will be prompted for the label to download, I chose latest. Once the image is downloaded (you can check the Image tab for progress), go to the Container tab and click Create. This will open the Create Wizard and get you started.

    • On the Image screen, select the minio/minio:latest image.
    • On the Network screen, select the bridge network that is defaulted. If you have a custom network configuration, you may have some work here.
    • On the General Settings screen, you can name the container whatever you like. I enabled the auto-restart option to keep it running. On this screen, click on the Advanced Settings button
      • In the Environment tab, change MINIO_CONFIG_ENV_FILE to /etc/config.env
      • In the Execution Command tab, change the execution command to minio server --console-address :9090
      • Click Save to close Advanced Settings
    • On the Port Settings screen, add the following mappings:
      • Local Port 39000 -> Container Port 9000 – Type TCP
      • Local Port 39090 -> Container Port 9090 – Type TCP
    • On the Volume Settings Screen, add the following mappings:
      • Click Add File, select the minio file created above, and set the mount path to /etc/config.env
      • Click Add Folder, select the data folder created above, and set the mount path to /mnt/data

    At that point, you can view the Summary and then create the container. Once the container starts, you can access your Minio instance at http://<synology_ip_or_hostname>:39090 and log in with the password saved in your config file.

    What Just Happened?

    The above steps should have worked to create a Docker container running on Synology on your Minio. Minio has two separate ports: one for the API, and one for the Console. Reviewing Minio’s documentation, adding the --console-address parameter in the container execution is required now, and that sets the container port for the console. In our case, we set it to 9090. The API port defaults to 9000.

    However, I wanted to run on non-standard ports, so I mapped ports 39090 and 39000 to port 9090 and 9000, respectively. That means that traffic coming in on 39090 and 39000 get routed to my Minio container on ports 9090 and 9000, respectively.

    Securing traffic with Nginx

    I like the ability to have SSL communication whenever possible, even if it is just within my home network. Most systems today default to expecting SSL, and sometimes it can be hard to find that switch to let them work with insecure connections.

    I was hoping to get the console and the API behind the same domain, but with SSL, that just isn’t in the cards. So, I chose s3.mattsdatacenter.net as the domain for the API, and storage.mattsdatacenter.net as the domain for the Console. No, those aren’t the real domain names.

    With that, I added the following sites to my Nginx configuration:

    storage.mattsdatacenter.net
      map $http_upgrade $connection_upgrade {
          default Upgrade;
          ''      close;
      }
    
      server {
          server_name storage.mattsdatacenter.net;
          client_max_body_size 0;
          ignore_invalid_headers off;
          location / {
              proxy_pass http://10.0.0.23:39090;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-proto $scheme;
              proxy_set_header X-Forwarded-port $server_port;
              proxy_set_header X-Forwarded-for $proxy_add_x_forwarded_for;
    
              proxy_set_header Upgrade $http_upgrade;
              proxy_set_header Connection $connection_upgrade;
    
              proxy_http_version 1.1;
              proxy_read_timeout 900s;
              proxy_buffering off;
          }
    
        listen 443 ssl; # managed by Certbot
        allow 10.0.0.0/24;
        deny all;
    
        ssl_certificate /etc/letsencrypt/live/mattsdatacenter.net/fullchain.pem; # managed by Certbot
        ssl_certificate_key /etc/letsencrypt/live/mattsdatacenter.net/privkey.pem; # managed by Certbot
    }
    s3.mattsdatacenter.net
      map $http_upgrade $connection_upgrade {
          default Upgrade;
          ''      close;
      }
    
      server {
          server_name s3.mattsdatacenter.net;
          client_max_body_size 0;
          ignore_invalid_headers off;
          location / {
              proxy_pass http://10.0.0.23:39000;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-proto $scheme;
              proxy_set_header X-Forwarded-port $server_port;
              proxy_set_header X-Forwarded-for $proxy_add_x_forwarded_for;
    
              proxy_set_header Upgrade $http_upgrade;
              proxy_set_header Connection $connection_upgrade;
    
              proxy_http_version 1.1;
              proxy_read_timeout 900s;
              proxy_buffering off;
          }
    
        listen 443 ssl; # managed by Certbot
        allow 10.0.0.0/24;
        deny all;
    
        ssl_certificate /etc/letsencrypt/live/mattsdatacenter.net/fullchain.pem; # managed by Certbot
        ssl_certificate_key /etc/letsencrypt/live/mattsdatacenter.net/privkey.pem; # managed by Certbot
    }

    This configuration allows me to access the API and Console via domains using SSL terminated on the proxy. Configuring Minio is pretty easy: set MINIO_BROWSER_REDIRECT_URL to the URL of your console (In my case, port 39090), and MINIO_SERVER_URL to the URL of your API (port 39000).

    This configuration allows me to address Minio for S3 in two ways:

    1. Use https://s3.mattsdatacenter.net for secure connectivity through the reverse proxy.
    2. Use http://<synology_ip_or_hostname>:39000 for insecure connectivity directly to the instance.

    I have not had the opportunity to test the performance difference between option 1 and option 2, but it is nice to have both available. For now, I will most likely lean towards the SSL path until I notice degradation in connection quality or speed.

    And, with that, my Minio instance is now running on my Diskstation, which means less VMs to manage and backup on my hypervisor.

  • Home Lab – No More iSCSI, Prep and Planning

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning (this post)
    • Home Lab – No More iSCSI: Shutdown and Provisioning (coming soon)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    I realized today that my home lab setup, by technology standards, is old. Sure, my overall setup has gotten some incremental upgrades, including an SSD cache for the Synology, a new Unifi Security Gateway, and some other new accessories. The base Hyper-V server, however, has remained untouched, outside of the requisite updates.

    Why no upgrades? Well, first, it is my home lab. I am a software engineer by trade, and the lab is meant for me to experiment not with operating systems or network configurations, but with application development and deployment procedures, tools, and techniques. And for that, it has worked extremely well over the last five years.

    That said, my initial setup had some flaws, and I am seeing some stability issues that I would like to correct now, before I wake up one morning with nothing working. With that in mind, I have come up with a plan.

    Setup a Temporary Server

    I am quite sure you’re thinking to yourself “What do you mean, temporary server?” Sure, I could shut everything down, copy it off the server onto the Synology, and then re-install the OS. And while this is a home lab and supposedly “throw away,” there are some things running that I consider production. For example:

    1. Unifi Controller – I do not yet have the luxury of running a Unifi Dream Machine Pro, but it is on my wish list. In the meantime, I run an instance of the Unifi controller in my “production” cluster.
    2. Home Assistant – While I am still rocking an ISY994i as an Insteon interface, I moved most of my home automation to a Home Assistant instance in the cluster.
    3. Node Red – I have been using Node-Red with the Home Assistant palette to script my automations.
    4. Windows Domain Controller – I am still rocking a Windows Domain at home. It is the easiest way to manage the Hypervisor, as I am using the”headless” version of Windows Hyper-V Server 2019.
    5. Mattgerega.com – Yup, this site runs in my cluster.

    Thankfully, my colleague Justin happened to have an old server lying around that he has not powered on in a while, and has graciously allowed me to borrow it so that I can transfer my production assets over and keep things going.

    We’re gonna change the way we run…

    My initial setup put the bulk of my storage on the Synology via iSCSI, so much so that I had to put an SSD cache in the Synology just to keep up. At the beginning, that made sense. I was running mostly Windows VMs, and my vital data was stored on the VM itself. I did not have a suitable backup plan, so having all that data on the Synology meant I had at least some drive redundancy.

    Times have changed. My primary mechanism for running applications is via Kubernetes clusters. Those nodes typically contain no data at all, as I use an NFS provisioner and storage class to create persistent storage volumes via NFS on the Synology. And while I still have a few VMs with data on them that will need backed up, I really want to get away from iSCSI.

    The server I have, an old HP Proliant DL380 Gen8, has 8 2.5″ drive bays. My original impression was that I needed to buy SAS drives for it, but Justin said he has had luck running SATA SSDs in his.

    Requirements

    Even with a home lab move, it is always good to have some clear requirements.

    1. Upgrade my Hyper-V Server to 2019.
    2. No more iSCSI disks on the server: Rely on NFS and proper backup procedures.
    3. Fix my networking: I had originally teamed 4 of the 6 NIC ports on the server together. While I may still do that, I need to clean up that implementation, as I have learned a lot in the last few years.
    4. Keep is simple.

    Could I explore VMWare or Proxmox? I could, but, frankly, I want to learn more about Kubernetes and how I can use it in application architecture to speed delivery and reliability. I do not really care what the virtualization technology is, as long as I can run Kubernetes. Additionally, I have a LOT of automation around building Hyper-V machines, I do not want to do it again.

    Since this is my home lab, I do not have a lot of time to burn on it, hence the KISS methods. Switching virtualization stacks means more time converting images. Going from Hyper-V to Hyper-V means, for production VMs, I can setup replication and just move them to the temp server and back again.

    Prior Proper Planning

    With my requirements set, I created a plan:

    • Configure the temporary server and get production servers moved. This includes consolidating “production” databases into a single DB server, which is a matter of moving one or two DBs.
    • Shut down all other VMs and copy them over to a fileshare on the Synology.
    • Fresh installation of Windows Server 2019 Hyper-V.
    • Add 2 1TB SSDs into the hypervisor in a RAID 1 array.
    • Replicate the VMs from the temporary server to the new hypervisor.
    • Copy the rest of the VMs to the new server and start them up.
    • Create some backup procedures for data stored on the hypervisor (i.e., if it is on a VM’s drive, it needs put on the Synology somewhere)
    • Delete my iSCSI LUN from the Synology.

    So, what’s done?

    I am, quite literally, still on step one. I got the temporary server up and running with replication, and I am starting to move production images. Once my temporary production environment is running, I will get started on the new server. I will post some highlights of that process in the days to come.