Category: Technology

  • Home Lab – No More iSCSI – Backup Plans

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    It is worth nothing (and quite ironic) that I went through a fire drill last week when I crashed my RKE clusters. That event gave me some fresh eyes into the data that is important to me.

    How much redundancy do I need?

    I have been relying primarily on the redundancy of the Synology for a bit too long. The volume has the capability to lose a disk and the Synology has been very stable, but that does not mean I should leave things as they are.

    There are many layers of redundancy, and for a home lab, it is about making decisions as to how much you are willing to pay and what you are willing to lose.

    No Copy, Onsite Copy, Offsite Copy

    I prefer not to spend a ton of time thinking about all of this, so I created three “buckets” for data priority:

    • No Backup: Synology redundancy is sufficient. If I lose it, I lose it.
    • Onsite Copy: Create another copy of the data somewhere at home. For this, I am going to attach a USB enclosure with a 2TB disk to my Synology and setup USB copy tasks on the Diskstation Manager (DSM).
    • Offsite Copy: Ship the data offsite for safety. I have been using Backblaze B2 buckets and the DSM’s Cloud Sync for personal documents for years, but the time has come to scale up a bit.

    It is worth noting that some things may be bucketed into both Onsite and Offsite, depending on how critically I need the data. With the inventory I took over the last few weekds, I had some decisions to make.

    • Domain Controllers -> OnSite copy for sure. I am not yet sure if I want to add an Offsite copy, though: The domain doesn’t have enough on it that it cannot be rebuilt quickly, and there are really only a handful of machines on it. It just makes managing Windows Servers much easier.
    • Kubernetes NFS Data -> I use nfs-subdir-external-provisioner to provide persistent storage for my Kubernetes clusters. I will certainly do OnSite copies of this data, but for the most important ones (such as this blog), I will also setup an offsite transfer.
    • SQL Server Data -> The SQL Server data is being stored on an iSCSI LUN, but I configured regular backups to go to a file share on the Synology. From there, OnSite backups should be sufficient.
    • Personal Stuff -> I have a lot of personal data (photos, financial data, etc.) stored on the Synology. That data is already encrypted and sent to Backblaze, but I may add another layer of redundancy and do an Onsite copy of them as well.

    Solutioning

    Honestly, I thought this would be harder, but Synology’s DSM and available packages really made it easy.

    1. VM Backups with Active Backup for Business: Installed Active Backup for Business, setup a connection to my Hyper-V server, picked the machines I wanted to backup…. It really was that simple. I should test a recovery, but on a test VM.
    2. Onsite Copies with USB Copy: I plugged an external HD into the Synology, which was immediately recognized and a file share created. I installed the USB Copy package and started configuring tasks. Basically, I can setup copy tasks to move data from the Synology to the USB as desired, and includes various settings, such as incremental or versioned backups, triggers, and file filters.
    3. SQL Backups: I had to refresh my memory on scheduling SQL backups in SQL Server. Once I had that done, I just made sure to back them up to a share on the Synology. From there, USB Copy took care of the rest.
    4. Offsite: As I mentioned, I have had Cloud Sync running to Backblaze B2 buckets for a while. All I did was expand my copying. Cloud Sync offers some of the same flexibility as USB Copy, but having well-structured file shares for your data makes it easier to select and push data as you want it.

    Results and What’s next

    My home lab refresh took me about 2 weeks, albeit during a few evenings across that time span. What I am left with is a much more performant server. While I still store data on the Synology via NFS and iSCSi, it’s only smaller parts that are less reliant on fast access. The VM disks live on an SSD RAID array on the server, which gives me added stability and less thrashing of the Synology and its SSD cache. This is no more evident than the fact that my average daily SSD temp has gone down 12°F over the last 2 weeks.

    What’s next? I will be taking a look at alternatives to Rancher Kubernetes Engine. I am hoping to find something a bit more stable and secure to manage.

  • Nothing says “Friday Night Fun” like crashing an RKE Cluster!

    Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

    It all started with an upgrade…

    All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

    Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

    Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

    First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

    I lucked out

    Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

    Cleaning Up

    Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

    There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

    The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

    This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

    Lessons Learned

    For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

    Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

    I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.

  • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning
    • Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild (this post)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    Observations – Migrating Servers

    The focus of my hobby time over the few days has been moving production assets to the temporary server. Most of it is fairly vanilla, but I have a few observations worth noting.

    • I forgot how easy it was to replicate and failover VMs with Hyper-V. Sure, I could have tried a live migration, but creating a replica, shutting down the machine, and failing over was painless.
    • Do not forget to provision an external virtual switch on your Hyper-V servers. Yes, sounds stupid, but, I dove right in to setting the temporary server up as a replication server, and upon trying to failover, realized that the machine on the new server did not have a network connection.
    • I moved my Minio instance to the Synology: I originally had my Minio server running on an Ubuntu VM on my hypervisor, but decided moving the storage application closer to the storage medium was generally a good idea.
    • For my Kubernetes nodes, it was easier to provision new nodes on the temp server than it was to do a live migration or planned failover. I followed my normal process for provisioning new nodes and decommissioning old ones, and viola, my production cluster is on the temporary server. I will simply reverse the process for the transfer back.
    • I am getting noticeably better performance on the temporary server, which has far less compute and RAM, but the VMs are on local disks. While the Synology has been a rock solid, I think I have been throwing too much at it, and it can slow down from time to time.

    Let me be clear: My network storage is by no means bad, and it will be utilized. But storing the primary vhdx files for my VMs on the hypervisor provides much better performance.

    Shut It Down!

    After successfully moving my production assets over to the temporary server, it was time to shut it down. I shut down the VMs that remained on the original hypervisor and attempted to copy the VMs to a network drive on the Synology. That was a giant mistake.

    Those VM files already live on the Synology as part of an iSCSI volume. By trying to pull those files off of the iSCSI drive and copy them back to the Synology, I was basically doing a huge file copy (like, 600+ GB huge) without the systems really knowing it was copy. As you can imagine, the performance was terrible.

    I found a 600TB SAS drive that I was able to plug into the old hypervisor, and I used that as a temporary location for the copy. Even with that change, the copy took a while (I think about 3 hours).

    Upgrade and Install

    I mounted my new SSDs (Samsung EVO 1TB) in some drive trays and plugged them into the server. A quick boot to the Smart Storage administrator let me setup a new drive array. While I thought about just using RAID 0 and letting me have 2 TB of stuff, I went the safe option and used RAID 1.

    Having configured the temporary server with Windows Server Hyper-V 2019, the process of doing it again was, well, pretty standard. I booted to the USB stick I created earlier for Hyper-V 2019 and went through the paces. My domain controller was still live (thanks temporary server!), so I was able to add the machine to domain and then perform all of the management via the Server Manager tool on my laptop.

    Moving back in

    I have the server back up with a nice new 1TB drive for my VMs. That’s a far cry from the 4 TB of storage I had allocated on the SAN target on the Synology, so I have to be more careful with my storage.

    Now, if I set a Hyper-V disk to, say, 100Gb, Hyper-V does not actually provision a file that is 100Gb: the vhdx file grows with time. But that does not mean I should just mindlessly provision disk space on my VMs.

    For my Kubernetes nodes, looking at my usage, 50GB is more than enough for those disks. All persistent storage for those workloads is handled by an NFS provisioner which configures shares on the Synology. As for the domain controllers, I am able to run with minimal storage because, well, it is a tiny domain.

    The problem children are Minio and my SQL Server Databases. Minio I covered above, moving it to the Synology directly. SQL Server, however, is a different animal.

    Why be you, when you can be new!

    I already had my production SQL instance running on another server. Rather than move it around and then mess with storage, I felt the safer solution was to provision a new SQL Server instance and migrate my databases. I only have 4 databases on that server, so moving databases is not a monumental task.

    A new server affords me two things:

    1. Latest and greatest version of Windows and SQL server.
    2. Minimal storage on the hypervisor disk itself. I provisioned only about 80 GB for the main virtual disk. This worked fine, except that I ran into a storage compatibility issue that needed a small workaround.

    SMB 3.0, but only certain ones

    My original intent was to create a virtual disk on a network share on the Synology, and mount that disk to the new SQL Server VM. That way, to the SQL Server, the storage is local, but the SQL data would be on the Synology.

    Hyper-V did not like this. I was able to create a vhdx file on a share just fine, but when I tried to add it to a VM using Add-VMHardDiskDrive, I got the following error:

    Remote SMB share does not support resiliency.

    A quick Google search turned up this Spiceworks question, where the only answer suggests that the Synology SMB 3.0 implementation is Linux-based, where Hyper-V is looking to use the Windows-based implementation, and that there are things missing in Linux.

    While I am usually not one to take one answer and call it fact, I also didn’t want to spend too much time getting into the nitty gritty. I knew it was a possibility that this wasn’t going to work, and, in the interest of time, I went back to my old pal iSCSI. I provisioned a small iSCSI LUN (300 GB) and mounted directly in the virtual machine. So now my SQL Server has a data drive that uses the Synology for storage.

    And we’re back!

    Moves like this provide an opportunity for consolidation, updates, and improvements, and I seized some of those opportunities:

    • I provisioned new Active Directory Domain controllers on updated operating systems, switched over, and deleted the old one.
    • I moved Minio to my Synology, and moved Hashicorp Vault to my Kubernetes cluster (using Minio as a storage backend). This removed 2 virtual machines from the hypervisor.
    • I provisioned a new SQL Server and migrated my production databases to it.
    • Compared to the rats nest of network configuration I had, the networking on the hypervisor is much simpler:
      • 1 standard NIC with a static IP so that I can get in and out of the hypervisor itself.
      • 1 teamed NIC with a static IP attached to the Hyper-V Virtual Switch.
    • For the moment, I did not bring back my “non-production” cluster. It was only running test/stage environments of some of my home projects. For the time being, I will most likely move these workloads to my internal cluster.

    I was able to shut down the temporary server, meaning, at least in my mind, I am back to where I was. However, now that I have things on the hypervisor itself, my next step is to ensure I am appropriately backing things up. I will finish this series with a post on my backup configuration.

  • Installing Minio on a Synology Diskstation with Nginx SSL

    In an effort to get rid of a virtual machine on my hypervisor, I wanted to move my Minio instance to my Synology. Keeping the storage interface close to the storage container helps with latency and is, well, one less thing I have to worry about in my home lab.

    There are a few guides out there for installing Minio on a Synology. Jaroensak Yodkantha walks you through the full process of setting up the Synology and Minio using a docker command line. The folks over at BackupAssist show you how to configure Minio through the Diskstation Manager web portal. I used the BackupAssist article to get myself started, but found myself tweaking the setup because I want to have SSL communication available through my Nginx reverse proxy.

    The Basics

    Prep Work

    I went in to the Shared Folder section of the DSM control panel and created a new shared folder called minio. The settings on this share are pretty much up to you, but I did this so that all of my Minio data was in a known location.

    Within the minio folder, I created a data folder and a blank text file called minio. Inside the minio file, I setup my minio configuration:

    # MINIO_ROOT_USER and MINIO_ROOT_PASSWORD sets the root account for the MinIO server.
    # This user has unrestricted permissions to perform S3 and administrative API operations on any resource in the deployment.
    # Omit to use the default values 'minioadmin:minioadmin'.
    # MinIO recommends setting non-default values as a best practice, regardless of environment
    
    MINIO_ROOT_USER=myadmin
    MINIO_ROOT_PASSWORD=myadminpassword
    
    # MINIO_VOLUMES sets the storage volume or path to use for the MinIO server.
    
    MINIO_VOLUMES="/mnt/data"
    
    # MINIO_SERVER_URL sets the hostname of the local machine for use with the MinIO Server
    # MinIO assumes your network control plane can correctly resolve this hostname to the local machine
    
    # Uncomment the following line and replace the value with the correct hostname for the local machine.
    
    MINIO_SERVER_URL="https://s3.mattsdatacenter.net"
    MINIO_BROWSER_REDIRECT_URL="https://storage.mattsdatacenter.net"

    It is worth noting the URLs: I want to put this system behind my Nginx reverse proxy and let it do SSL termination, and in order to do that, I found it easiest to use two domains: one for the API and one for the Console. I will get into more details on that later.

    Also, as always, change your admin username and password!

    Setup the Container

    Following the BackupAssist article, I installed the Docker package on to my Synology and opened it up. From the Registry menu, I searched for minio and found the minio/minio image:

    Click on the row to highlight it, and click on the Download button. You will be prompted for the label to download, I chose latest. Once the image is downloaded (you can check the Image tab for progress), go to the Container tab and click Create. This will open the Create Wizard and get you started.

    • On the Image screen, select the minio/minio:latest image.
    • On the Network screen, select the bridge network that is defaulted. If you have a custom network configuration, you may have some work here.
    • On the General Settings screen, you can name the container whatever you like. I enabled the auto-restart option to keep it running. On this screen, click on the Advanced Settings button
      • In the Environment tab, change MINIO_CONFIG_ENV_FILE to /etc/config.env
      • In the Execution Command tab, change the execution command to minio server --console-address :9090
      • Click Save to close Advanced Settings
    • On the Port Settings screen, add the following mappings:
      • Local Port 39000 -> Container Port 9000 – Type TCP
      • Local Port 39090 -> Container Port 9090 – Type TCP
    • On the Volume Settings Screen, add the following mappings:
      • Click Add File, select the minio file created above, and set the mount path to /etc/config.env
      • Click Add Folder, select the data folder created above, and set the mount path to /mnt/data

    At that point, you can view the Summary and then create the container. Once the container starts, you can access your Minio instance at http://<synology_ip_or_hostname>:39090 and log in with the password saved in your config file.

    What Just Happened?

    The above steps should have worked to create a Docker container running on Synology on your Minio. Minio has two separate ports: one for the API, and one for the Console. Reviewing Minio’s documentation, adding the --console-address parameter in the container execution is required now, and that sets the container port for the console. In our case, we set it to 9090. The API port defaults to 9000.

    However, I wanted to run on non-standard ports, so I mapped ports 39090 and 39000 to port 9090 and 9000, respectively. That means that traffic coming in on 39090 and 39000 get routed to my Minio container on ports 9090 and 9000, respectively.

    Securing traffic with Nginx

    I like the ability to have SSL communication whenever possible, even if it is just within my home network. Most systems today default to expecting SSL, and sometimes it can be hard to find that switch to let them work with insecure connections.

    I was hoping to get the console and the API behind the same domain, but with SSL, that just isn’t in the cards. So, I chose s3.mattsdatacenter.net as the domain for the API, and storage.mattsdatacenter.net as the domain for the Console. No, those aren’t the real domain names.

    With that, I added the following sites to my Nginx configuration:

    storage.mattsdatacenter.net
      map $http_upgrade $connection_upgrade {
          default Upgrade;
          ''      close;
      }
    
      server {
          server_name storage.mattsdatacenter.net;
          client_max_body_size 0;
          ignore_invalid_headers off;
          location / {
              proxy_pass http://10.0.0.23:39090;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-proto $scheme;
              proxy_set_header X-Forwarded-port $server_port;
              proxy_set_header X-Forwarded-for $proxy_add_x_forwarded_for;
    
              proxy_set_header Upgrade $http_upgrade;
              proxy_set_header Connection $connection_upgrade;
    
              proxy_http_version 1.1;
              proxy_read_timeout 900s;
              proxy_buffering off;
          }
    
        listen 443 ssl; # managed by Certbot
        allow 10.0.0.0/24;
        deny all;
    
        ssl_certificate /etc/letsencrypt/live/mattsdatacenter.net/fullchain.pem; # managed by Certbot
        ssl_certificate_key /etc/letsencrypt/live/mattsdatacenter.net/privkey.pem; # managed by Certbot
    }
    s3.mattsdatacenter.net
      map $http_upgrade $connection_upgrade {
          default Upgrade;
          ''      close;
      }
    
      server {
          server_name s3.mattsdatacenter.net;
          client_max_body_size 0;
          ignore_invalid_headers off;
          location / {
              proxy_pass http://10.0.0.23:39000;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-proto $scheme;
              proxy_set_header X-Forwarded-port $server_port;
              proxy_set_header X-Forwarded-for $proxy_add_x_forwarded_for;
    
              proxy_set_header Upgrade $http_upgrade;
              proxy_set_header Connection $connection_upgrade;
    
              proxy_http_version 1.1;
              proxy_read_timeout 900s;
              proxy_buffering off;
          }
    
        listen 443 ssl; # managed by Certbot
        allow 10.0.0.0/24;
        deny all;
    
        ssl_certificate /etc/letsencrypt/live/mattsdatacenter.net/fullchain.pem; # managed by Certbot
        ssl_certificate_key /etc/letsencrypt/live/mattsdatacenter.net/privkey.pem; # managed by Certbot
    }

    This configuration allows me to access the API and Console via domains using SSL terminated on the proxy. Configuring Minio is pretty easy: set MINIO_BROWSER_REDIRECT_URL to the URL of your console (In my case, port 39090), and MINIO_SERVER_URL to the URL of your API (port 39000).

    This configuration allows me to address Minio for S3 in two ways:

    1. Use https://s3.mattsdatacenter.net for secure connectivity through the reverse proxy.
    2. Use http://<synology_ip_or_hostname>:39000 for insecure connectivity directly to the instance.

    I have not had the opportunity to test the performance difference between option 1 and option 2, but it is nice to have both available. For now, I will most likely lean towards the SSL path until I notice degradation in connection quality or speed.

    And, with that, my Minio instance is now running on my Diskstation, which means less VMs to manage and backup on my hypervisor.

  • Home Lab – No More iSCSI, Prep and Planning

    This post is part of a short series on migrating my home hypervisor off of iSCSI.

    • Home Lab – No More iSCSI: Prep and Planning (this post)
    • Home Lab – No More iSCSI: Shutdown and Provisioning (coming soon)
    • Home Lab – No More iSCSI: Backup plans (coming soon)

    I realized today that my home lab setup, by technology standards, is old. Sure, my overall setup has gotten some incremental upgrades, including an SSD cache for the Synology, a new Unifi Security Gateway, and some other new accessories. The base Hyper-V server, however, has remained untouched, outside of the requisite updates.

    Why no upgrades? Well, first, it is my home lab. I am a software engineer by trade, and the lab is meant for me to experiment not with operating systems or network configurations, but with application development and deployment procedures, tools, and techniques. And for that, it has worked extremely well over the last five years.

    That said, my initial setup had some flaws, and I am seeing some stability issues that I would like to correct now, before I wake up one morning with nothing working. With that in mind, I have come up with a plan.

    Setup a Temporary Server

    I am quite sure you’re thinking to yourself “What do you mean, temporary server?” Sure, I could shut everything down, copy it off the server onto the Synology, and then re-install the OS. And while this is a home lab and supposedly “throw away,” there are some things running that I consider production. For example:

    1. Unifi Controller – I do not yet have the luxury of running a Unifi Dream Machine Pro, but it is on my wish list. In the meantime, I run an instance of the Unifi controller in my “production” cluster.
    2. Home Assistant – While I am still rocking an ISY994i as an Insteon interface, I moved most of my home automation to a Home Assistant instance in the cluster.
    3. Node Red – I have been using Node-Red with the Home Assistant palette to script my automations.
    4. Windows Domain Controller – I am still rocking a Windows Domain at home. It is the easiest way to manage the Hypervisor, as I am using the”headless” version of Windows Hyper-V Server 2019.
    5. Mattgerega.com – Yup, this site runs in my cluster.

    Thankfully, my colleague Justin happened to have an old server lying around that he has not powered on in a while, and has graciously allowed me to borrow it so that I can transfer my production assets over and keep things going.

    We’re gonna change the way we run…

    My initial setup put the bulk of my storage on the Synology via iSCSI, so much so that I had to put an SSD cache in the Synology just to keep up. At the beginning, that made sense. I was running mostly Windows VMs, and my vital data was stored on the VM itself. I did not have a suitable backup plan, so having all that data on the Synology meant I had at least some drive redundancy.

    Times have changed. My primary mechanism for running applications is via Kubernetes clusters. Those nodes typically contain no data at all, as I use an NFS provisioner and storage class to create persistent storage volumes via NFS on the Synology. And while I still have a few VMs with data on them that will need backed up, I really want to get away from iSCSI.

    The server I have, an old HP Proliant DL380 Gen8, has 8 2.5″ drive bays. My original impression was that I needed to buy SAS drives for it, but Justin said he has had luck running SATA SSDs in his.

    Requirements

    Even with a home lab move, it is always good to have some clear requirements.

    1. Upgrade my Hyper-V Server to 2019.
    2. No more iSCSI disks on the server: Rely on NFS and proper backup procedures.
    3. Fix my networking: I had originally teamed 4 of the 6 NIC ports on the server together. While I may still do that, I need to clean up that implementation, as I have learned a lot in the last few years.
    4. Keep is simple.

    Could I explore VMWare or Proxmox? I could, but, frankly, I want to learn more about Kubernetes and how I can use it in application architecture to speed delivery and reliability. I do not really care what the virtualization technology is, as long as I can run Kubernetes. Additionally, I have a LOT of automation around building Hyper-V machines, I do not want to do it again.

    Since this is my home lab, I do not have a lot of time to burn on it, hence the KISS methods. Switching virtualization stacks means more time converting images. Going from Hyper-V to Hyper-V means, for production VMs, I can setup replication and just move them to the temp server and back again.

    Prior Proper Planning

    With my requirements set, I created a plan:

    • Configure the temporary server and get production servers moved. This includes consolidating “production” databases into a single DB server, which is a matter of moving one or two DBs.
    • Shut down all other VMs and copy them over to a fileshare on the Synology.
    • Fresh installation of Windows Server 2019 Hyper-V.
    • Add 2 1TB SSDs into the hypervisor in a RAID 1 array.
    • Replicate the VMs from the temporary server to the new hypervisor.
    • Copy the rest of the VMs to the new server and start them up.
    • Create some backup procedures for data stored on the hypervisor (i.e., if it is on a VM’s drive, it needs put on the Synology somewhere)
    • Delete my iSCSI LUN from the Synology.

    So, what’s done?

    I am, quite literally, still on step one. I got the temporary server up and running with replication, and I am starting to move production images. Once my temporary production environment is running, I will get started on the new server. I will post some highlights of that process in the days to come.

  • Managing Hyper-V VM Startup Times with .Net Minimal APIs

    In a previous post, I had a to-do list that included managing my Hyper-V VMs so that they did not all start at once. I realized today that I never explained what I was able to do or post the code for my solution. So today, you get both.

    And, for the impatient among you, my repository for this API is on Github.

    Managing VMs with Powershell

    My plan of attack was something like this:

    • Organize Virtual Machines into “Startup Groups” which can be used to set Automatic Start Delays
    • Using the group and an offset within the group, calculate a start delay and set that value on the VM.

    Powershell’s Hyper-V Module is a powerful and pretty easy way to interact with the Hyper-V services on a particular machine. The module itself had all of the functionality I needed to implement my plan. This included the ability to modify the Notes of a VM. I am storing JSON in the Notes field to denote the start group and offset within the group. Powershell has the built in JSON conversion necessary to make quick work of retrieving this data from the VM’s Notes field and converting it into an object.

    Creating the API

    For the API, this seemed an appropriate time to try out the Minimal APIs in ASP.NET Core 6. Minimal APIs are Microsoft’s approach to building APIs fast, without all the boilerplate code that sometimes comes with .Net projects. For this project, as I only had three endpoints (and maybe some test/debug ones) and a few services, so it seemed a good candidate.

    Without getting into the details, I was pleased with the approach, although scaling this type of approach requires implementing some standards that, in the end, would have you re-designing the notion of Controllers as it exists in a typical API project. So, while it is great for small, agile APIs, if you expect your API to grow, stick with the Controller-structured APIs.

    Hosting the API

    The server I am using as a Hyper-V hypervisor is running a version of Windows Hyper-V server, which means it has a limited feature set that does not include Internet Information Systems (IIS). Even if it did, I want to keep the hypervisor focused on running VMs. However, in order to manage the VMs, the easiest path is to put the API on the hypervisor.

    With that in mind, I went about configuring this API to run within a Windows Service. That allowed me to ensure the API was running through standard service management (instead of as a console application) but still avoid the need for a heavy IIS install.

    I installed the service using one of the methods described in How to: Install and uninstall Windows services.  However, for proper access, the service needs to run as a user with Powershell access and rights to modify the VMs.  

    I created a new domain user and granted it the ability to perform a service log on via local security policy.  See Enable Service Logon for details.

    Prepping the VMs

    The API does not, at the moment, pre-populate the Notes field with JSON settings. So I went through my VM List and added the following JSON snippet:

    {
        "startGroup": 0,
        "delayOffset": 0
    }

    I chose a startGroup value based on the VM’s importance (Domain Controllers first, then data servers, then Kubernetes nodes, etc), and then used the delayOffset to further stagger the start times.

    All this for an API call

    Once each VM has the initialization data, I made a call to /vm/refreshdelay and viola! The AutomaticStartDelay gets set based on its startGroup and delayOffset.

    There is more to do (see my to-do list in my previous post for other next steps), but since I do not typically provision many machines, this one usually falls to a lower spot on the priority list. So, well, I apologize in advance if you do not see more work on this for another six months.

  • Pulling metrics from Home Assistant into Prometheus

    I have setup an instance of Home Assistant as the easiest front end for interacting with my home automation setup. While I am using the Universal Devices ISY994 as the primary communication hub for my Insteon devices, Home Assistant provides a much nicer interface for my family, including a great mobile app for them to use the system.

    With my foray into monitoring, I started looking around to see if I was able to get some device metrics from Home Assistant into my Grafana Mimir instance. Turns out, there is an a Prometheus integration built right in to Home Assistant.

    Read the Manual

    Most of my blog posts are “how to” style: I find a problem that maybe I could not find an exact solution for online, and walk you through the steps. In this case, though, it was as simple as reading the configuration instructions for the Prometheus integration.

    ServiceMonitor?

    Well, almost that easy. I have been using ServiceMonitor resources within my clusters, rather than setting up explicit scrape configs. Generally, this is easier to manage, since I just install the Prometheus operator, and then create ServiceMonitor instances when I want Prometheus to scrape an endpoint.

    The Home Assistant Prometheus endpoint requires a token, however, and I did not have the desire to dig in to configuring a ServiceMonitor with an appropriate secret. For now, it is a to-do on my ever-growing list

    What can I do now?

    This integration has opened up a LOT of new alerts on my end. Home Assistant talks to many of the devices in my home, including lights and garage doors. This means I can write alerts for when lights go on or off, when the garage door goes up or down, and, probably the best, when devices are reporting low battery.

    The first alert I wrote was to alert me when my Ring Doorbell battery drops below 30%. Couple that with my Prometheus Alerts module for Magic Mirror, and I now get a display when the battery needs changed.

    What’s Next?

    I am giving back to the community. The Prometheus integration for Home Assistant does not currently report cover statuses. Covers are things like shades or, in my case, garage doors. Since I would like to be able to alert when the garage door is open, I am working on a pull request to add cover support to the Prometheus integration.

    It also means I would LOVE to get my hands on some automated shades/blinds… but that sounds really expensive.

  • Lessons in Managing my Kubernetes Cluster: Man Down!

    I had a bit of a panic this week as routine tasks took me down a rabbit hole in Kubernetes. The more I manage my home lab clusters, the more I realize I do not want to be responsible for bare metal clusters at work.

    It was a typical upgrade…

    With ArgoCD in place, the contents of my clusters is very neatly defined in my various infrastructure repositories. I even have a small Powershell script that checks for the latest versions of the tools I have installed and updates my repositories with the latest and greatest.

    I ran that script today and noticed a few minor updates to some of my charts, so I figured I would apply those at lunch. Pretty typically, it is a very smooth application and the updates are applied in a few minutes.

    However, after having ArgoCD sync, I realized that the Helm charts I was upgrading was stuck in a “Progressing” state. I checked the pods, and they were in a perpetual “Pending” state.

    My typical debugging found nothing: there were no events around the pod being unable to be scheduled for any particular reason, and there was no log file, since the pods were not even being created.

    My first Google searches suggested a problem with persistent volumes/persistent volume claims. So I poked around in those, going so far as deleting them (after backing up the folders in my NFS target), but, well, no luck.

    And there it was…

    As it was “unschedulable,” I started trying to find the logs for the scheduler. I could not find them. So I logged in to my control plane node to see if the scheduler was even running… It was not.

    I checked the logs for that container, and nothing stood out. It just kind of “died.” I tried restarting the Docker container…. Nope. I even tried re-running the rke up command for that cluster, to see if Rancher Kubernetes Engine would restart it for me…. Nope.

    So, running out of options, I changed my cluster.yml configuration file to add a control plane role to another node in the cluster and re-ran rke up. And that, finally, worked. At that point, I removed the control plane role from the old control plane node and modified the DNS entries to point my hostnames to the new control plane node. With that, everything came back up.

    Monitoring?

    I wanted to write an alert in Mimir to catch this so I would know about it before I dug around. It was at this point I realized that I am not collecting any metrics from the Kubernetes components themselves. And, frankly, I am not sure how. RKE installs metrics-server by default, but I have not found a way to scrape metrics from Kubernetes components.

    My Google searches have been fruitless, and it has been a long work week, so this problem will most likely have to wait for a bit. If I come up with something I will update this post.

  • Hitting for the cycle…

    Well, I may not be hitting for the cycle, but I am certainly cycling Kubernetes nodes like it is my job. The recent OpenSSL security patches got me thinking that I need to cycle my cluster nodes.

    A Quick Primer

    In Kubernetes, a “node” is, well, a machine performing some bit of work. It could be a VM or a physical machine, but, at its core, it is a machine with a container runtime of some kind that runs Kubernetes workloads.

    Using the Rancher Kubernetes Engine (RKE), each node can have one or more of the following roles

    • worker – The node can host user workloads
    • etcd – The node is a member of the etcd storage cluster
    • controlplane – The node is a member of the control plane

    Cycle? What and why

    When I say “cycling” the nodes, I am actually referring to the process of provisioning a new node, adding the new node to the cluster, and removing an old node from the cluster. I use the term “cycling” because, when the process is complete, my cluster is “back where it started” in terms of resourcing, but with a fresh and updated node.

    But, why cycle? Why not just run the necessary security patches on my existing nodes? In my view, even at the home lab level, there a few reasons for this method.

    A Clean and Consistent Node

    As nodes get older, they incur the costs of age. They may have some old container images from previous workloads, or some cached copies of various system packages. In general, they collect stuff, and a fresh node has none of that cost. By provisioning new nodes, we can ensure that the latest provisioning is run and all the necessary updates are installed.

    By using newly provisioned nodes each time, it prevents me from applying special configuration to nodes. If I need a particular configuration on a node, well, it has to be done in the provisioning scripts. All my cluster nodes are provisioned the same way, which makes them much more like cattle than pets.

    No Downtime or Capacity Loss

    Running node maintenance can potentially require a system reboot or service restarts, which can trigger downtime. In order to prevent downtime, it is recommended that nodes be “cordoned” (prevent new workload from being scheduled on that node) and “drained” (remove workloads from the node).

    Kubernetes is very good at scheduling and moving workloads between nodes, and there is built-in functionality for cordon and draining of nodes. However, to prevent downtime, I have to remove a node for maintenance, which means I lose some cluster capacity during maintenance.

    When you think about it, to do a zero-downtime update, your choices are:

    1. Take a node out of the cluster, upgrade it, then add it back
    2. Add a new node to the cluster (already upgraded) then remove the old node.

    So, applying the “cattle” mentality to our infrastructure, it is preferred to have disposable assets rather than precisely manicured assets.

    RKE performs this process for you when you remove a node from the cluster. That means, running rke up will remove old nodes if they have been removed from your cluster.yml.

    To Do: Can I automate this?

    As of right now, this process is pretty manual, and goes something like this:

    1. Provision a new node – This part is automated with Packer, but I really only like doing 1 at a time. I am already running 20+ VMs on my hypervisor, I do not like the thought of spiking to 25+ just to cycle a node.
    2. Remove nodes, one role at a time – I have found that RKE is most stable when you only remove one role at a time. What does that mean? It means, if I have a node that is running all three nodes, I need to remove the control plane role, then run rke up. Then remove the etcd role, and run rke up again. Then remove the node completely. I have not had good luck simply removing a node with all three roles.
    3. Ingress Changes – I need to change the IPs on my cluster in two places:
      • In my external load balancer, which is a Raspberry Pi running nginx.
      • In my nginx-ingress installation on the cluster. This is done via my GitOps repository.
    4. DNS Changes – I have aliases setup for the control plan nodes so that I can swap them in and out easily without changing other configurations. When I cycle a control plane node, I need to update the DNS.
    5. Delete Old Node – I have a small Powershell script for this, but it is another step.

    It would be wonderful if I could automate this into an Azure DevOps pipeline, but there are some problems with that approach.

    1. Packer’s Hyper-V builder has to run on the host machine, which means I need to be able to execute the Packer build commands on my server. I’m not about about put the Azure DevOps agent directly on my server.
    2. I have not found a good way to automate DNS changes, outside of using the Powershell module.
    3. I need to automate the IP changes in the external proxy and the cluster ingress. Both of which are possible but would require some research on my end.
    4. I would need to automate the RKE actions, specifically, adding new nodes and deleting roles/nodes from the cluster, and then running rke up as needed.

    None of the above is impossible, however, it would require some effort on my part to research some techniques and roll them up into a proper pipeline. For the time being, though, I have set my “age limit” for nodes at 60 days, and will continue the manual process. Perhaps, after a round or two, I will get frustrated enough to start the automation process.

  • What’s in a (Release) Name?

    With Rancher gone, one of my clusters was dedicated to running Argo and my standard cluster tools. Another cluster has now become home for a majority of the monitoring tools, including the Grafana/Loki/Mimir/Tempo stack. That second cluster was running a little hot in terms of memory and CPU. I had 6 machines running what 4-5 machines could be doing, so a consolidation effort was in order. The process went fairly smoothly, but a small hiccup threw me for a frustrating loop.

    Migrating Argo

    Having configured Argo to manage itself, I was hoping that a move to a new cluster would be relatively easy. I was not disappointed.

    For reference:

    • The ops cluster is the cluster that was running Rancher and is now just running Argo and some standard cluster tools.
    • The internal cluster is the cluster that is running my monitoring stack and is the target for Argo in this migration

    Migration Process

    • I provisioned two new nodes for my internal cluster, and added them as workers to the cluster using the rke command line tool.
    • To prevent problems, I updated Argo CD to the latest version and disabled auto-sync on all apps.
    • Within my ops-argo repository, I changed the target cluster of my internal applications to https://kubernetes.default.svc. Why? Argo treats the cluster that it is installed a bit differently than external clusters. Since I was moving Argo on to my internal cluster, the references to my internal cluster needed to change.
    • Exported my ArgoCD resources using the Argo CD CLI. I was not planning on using the import, however, because I wanted to see how this process would work.
    • I modified my external Nginx proxy to point my Argo address from the ops cluster to the internal cluster. This essentially locked everything out of the old Argo instance, but let it run on the cluster should I need to fall back.
    • Locally, I ran helm install argocd . -n argo --create-namespace from my ops-argo repository. This installed Argo on my internal cluster in the argo namespace. I grabbed the newly generated admin password and saved it in my password store.
    • Locally, I ran helm install argocd-apps . -n argo from my ops-argo repository. This installs the Argo Application and Project resources which serve as the “app of apps” to bootstrap the rest of the applications.
    • I re-added my nonproduction and production clusters to be managed by Argo using the argocd cluster add command. As the URLs for the cluster were the same as they were in the old cluster, the apps matched up nicely.
    • I re-added the necessary labels to each cluster’s ArgoCD secret. This allows the cluster generator to create applications for my external cluster tool. I detailed some of this in a previous article, and the ops-argo repository has the ApplicationSet definitions to help.

    And that was it… Argo found the existing applications on the other clusters, and after some re-sync to fix some labels and annotations, I was ready to go. Well, almost…

    What happened to my metrics?

    Some of my metrics went missing. Specifically, many of those that were coming from the various exporters in my Prometheus install. When I looked at Mimir, the metrics just stopped right around the time of my Argo upgrade and move. I checked the local Prometheus install, and noticed that those targets were not part of the service discovery page.

    I did not expect Argo’s upgrade to change much, so I did not take notice of the changes. So, I was left with digging around into my Prometheus and ServiceMonitor instances to figure out why they are not showing.

    Sadly, this took way longer than I anticipated. Why? There were a few reasons:

    1. I have a home improvement project running in parallel, which means I have almost no time to devote to researching these problems.
    2. I did not have Rancher! One of the things I did use Rancher for was to easily view the different resources in the cluster and compare. I finally remembered that I could use Lens for a GUI into my clusters, but sadly, it was a few days into the ordeal.
    3. Everything else was working, I was just missing a few metrics. On my own personal severity level, this was low. On the other hand, it drove my OCD crazy.

    Names are important

    After a few days of hunting around, I realized that the ServiceMonitor’s matchLabels did not match the labels on the Service objects themselves. Which was odd, because, I had not changed anything in those charts, and they are all part of the Bitnami kube-prometheus Helm chart that I am using.

    As I poked around, I realized that I was using the releaseName property on the Argo Application spec. After some searching, I found the disclaimer on Argo’s website about using releaseName. As it turns out, this disclaimer describes exactly the issue that I was experiencing.

    I spent about a minute to see if I could fix it without removing the releaseName properties from my cluster tools, I realized that the easiest path was to remove that releaseName property from the cluster tools that used it. That follows the guidance that Argo suggests, and keeps my configuration files much cleaner.

    After removing that override, the ServiceMonitor resources were able to find their associated services, and showed up in Prometheus’ service discovery. With that, now my OCD has to deal with a gap in metrics…

    Missing Node Exporter Metrics