With all this talk of home lab cluster provisioning, you might be wondering if I am actually doing any software development at home. As a matter of fact, I am. Just because it is in support of my home lab provisioning does not mean it is not software development!
Keeping the Lab Tidy
One of the things that has bothered me in my home lab management is the DNS management. As I provision and remove Linux VMs, having appropriate DNS records for them makes it easy to find them. Generally it makes for a more tidy environment, as I have a list of my machines and their IPs in one place. I have a small Powershell module that uses the DnsServer module in Windows. What I wanted was an API that would allow me to manage my DNS.
Now, taking a cue from my Hyper-V wrapper, I created a small API that uses the DnsServer module to manage DNS entries. It was fairly easy, and works quite well on my own machine, which has the DnsServer module installed because I have the Remote Server Administrative Toolset installed.
Location, Location, Location
When I started looking at where I could host this service, I realized that I could not host it on my hypervisor as I did with the Hyper-V service. My server is running Windows Server 2019 Hyper-V edition, which is a stripped down version of Windows Server meant for hypervisors. That means I am unable to install the DNS Server role on it. Admittedly, I did not try installing RSAT on it, but I have tendency to believe that would not work.
Since the DnsServer module would be installed by default on my domain controller, I made the decision to host the DNS API on that server. I went about creating an appropriate service account and installed it as a service. Just like the Hyper-V API, the Windows DNS API is available on Github.
Return to API Management
At this point, I have API hosted on a few different machines plus the APIs hosted in my home lab clusters. This has forced me to revisit installing an API Management solution at home. Sure, no one else uses my lab, but that is not the point. Right now, I have a “service discovery” problem: where are my APIs, how do I call them, what is their authentication mechanism, etc. This is part of what API Management can solve: I can have a single place to locate and call my APIs. Over the next few weeks I may delve back into Gravitee.io in an effort to re-establish a proper API Management service.
Going Public, Going Github
While it may seem like I am “burying the headline,” I am going to start making an effort to go public with more of my code. Why? Well, I have a number of different repositories that might be of use to some folks, even as reference. Plus, well, it keeps me honest: Going public with my code means I have to be good about my own security practices. Look for posts on migration updates as I get to them.
Going public will most likely mean going Github. Yes, I have some public repositories out in Bitbucket, but Github provides a bit more community and visibility for my work. I am sure I will still keep some repositories in Bitbucket, but for the projects that I want public feedback on, I will shift to Github.
I spent the better part of the weekend recovering from crashing my RKE clusters last Friday. This put me on a path towards researching new Kubernetes clusters and determining the best path forward for my home lab.
Intentionally Myopic
Let me be clear: This is a home lab, whose purpose is not to help be build bulletproof, corporate production-ready clusters. I also do not want to run Minikube on a box somewhere. So, when I approached my “research” (you will see later why I put that term in quotes), I wanted to make sure I did not get bogged down in the minutia of different Kubernetes installs or details. I stuck with Rancher Kubernetes Engine (RKE1) for a long time because it was quick to stand up, relatively stable, and easy to manage.
So, when I started looking for alternatives, my first research was into whether Rancher had any updated offerings. And, with that, I found RKE2.
RKE2, aka RKE Government
I already feel safer knowing that RKE2’s alter ego is RKE Government. All joking aside, as I dug into RKE2, it seemed a good mix of RKE1, which I am used to, and K3s, a lightweight implementation of Kubernetes. The RKE2 documentation was, frankly, much more intuitive and easy to navigate than the RKE1 documentation. I am not sure if it is because the documentation is that much better or if because RKE2 is that much easier to configure.
I could spend pages upon pages explaining the experiments I ran over the last few evenings, but the proof is in the pudding, as they say. My provisioning-projects repository has a new Powershell script (Create-Rke2Cluster.ps1) that outlines the steps needed to get a cluster configured. My work, then, came down to how I wanted to configure the cluster.
RKE1 Roles vs RKE2 Server/Agent
RKE1 had a notion of node roles which were divided into three categories:
controlplane – Nodes with this role hosts the Kubernetes APIs
etcd – Nodes with this role host the etcd storage containers. There should be an odd number, at least 3 is a good choice.
worker – Nodes with this role can run workloads within the cluster.
My RKE1 clusters typically have the following setup:
One node with controlplane, etcd, and worker roles.
Two nodes with etcd and worker roles.
If needed, additional nodes with just the worker role.
This seemed to work well: I had proper redundancy with etcd and enough workers to host all of my workloads. Sure, I only had one control plane, so if that node went down, well, the cluster would be in trouble. However, I usually did not have much problem with keeping the nodes running so I left it as it stood.
With RKE2, there is simply a notion of server and agent. The server node runs etcd and the control plane components, while agents run only user defined workloads. So, when I started planning my RKE2 clusters, I figured I would run one server and two agents. The lack of etcd redundancy would not have me losing sleep at night, but I really did not want to run 3 servers and then more agents for my workloads.
As I started down this road, I wondered how I would be able to cycle nodes. I asked the #rke2 channel on rancher-users.slack.com, and got an answer from Brad Davidson: I should always have at least 2 available servers, even when cycling. However, he did mention something that was not immediately apparent: the server can and will run user-defined workloads unless the appropriate taints have been applied. So, in that sense, an RKE2 server acts similarly to my “all roles” node, where it functions as a control plane, etcd, and worker node.
The Verdict?
Once I saw a path forward with RKE2, I really have not looked back. I have put considerable time into my provisioning projects scripts, as well as creating a new API wrapper for Windows DNS management (post to follow).
“But Matt, you haven’t considered Kuberenetes X or Y?”
I know. There are a number of flavors of Kubernetes that can run your bare metal servers. I spent a lot of time and energy in learning RKE1, and I have gotten very good at managing those clusters. RKE2 is familiar, with improvements in all the right places. I can see automating not only machine provisioning, but the entire process of node replacement. I would love nothing more than to come downstairs on a Monday morning and see newly provisioned cluster nodes humming away after my automated process ran.
So, yes, maybe I skipped a good portion of that “research” step, but I am ok with it. After all, it is my home lab: I am more interested in re-visiting Gravitee.io for API management and starting to put some real code services out in the world.
Home Lab – No More iSCSI: Backup plans (this post)
It is worth nothing (and quite ironic) that I went through a fire drill last week when I crashed my RKE clusters. That event gave me some fresh eyes into the data that is important to me.
How much redundancy do I need?
I have been relying primarily on the redundancy of the Synology for a bit too long. The volume has the capability to lose a disk and the Synology has been very stable, but that does not mean I should leave things as they are.
There are many layers of redundancy, and for a home lab, it is about making decisions as to how much you are willing to pay and what you are willing to lose.
No Copy, Onsite Copy, Offsite Copy
I prefer not to spend a ton of time thinking about all of this, so I created three “buckets” for data priority:
No Backup: Synology redundancy is sufficient. If I lose it, I lose it.
Onsite Copy: Create another copy of the data somewhere at home. For this, I am going to attach a USB enclosure with a 2TB disk to my Synology and setup USB copy tasks on the Diskstation Manager (DSM).
Offsite Copy: Ship the data offsite for safety. I have been using Backblaze B2 buckets and the DSM’s Cloud Sync for personal documents for years, but the time has come to scale up a bit.
It is worth noting that some things may be bucketed into both Onsite and Offsite, depending on how critically I need the data. With the inventory I took over the last few weekds, I had some decisions to make.
Domain Controllers -> OnSite copy for sure. I am not yet sure if I want to add an Offsite copy, though: The domain doesn’t have enough on it that it cannot be rebuilt quickly, and there are really only a handful of machines on it. It just makes managing Windows Servers much easier.
Kubernetes NFS Data -> I use nfs-subdir-external-provisioner to provide persistent storage for my Kubernetes clusters. I will certainly do OnSite copies of this data, but for the most important ones (such as this blog), I will also setup an offsite transfer.
SQL Server Data -> The SQL Server data is being stored on an iSCSI LUN, but I configured regular backups to go to a file share on the Synology. From there, OnSite backups should be sufficient.
Personal Stuff -> I have a lot of personal data (photos, financial data, etc.) stored on the Synology. That data is already encrypted and sent to Backblaze, but I may add another layer of redundancy and do an Onsite copy of them as well.
Solutioning
Honestly, I thought this would be harder, but Synology’s DSM and available packages really made it easy.
VM Backups with Active Backup for Business: Installed Active Backup for Business, setup a connection to my Hyper-V server, picked the machines I wanted to backup…. It really was that simple. I should test a recovery, but on a test VM.
Onsite Copies with USB Copy: I plugged an external HD into the Synology, which was immediately recognized and a file share created. I installed the USB Copy package and started configuring tasks. Basically, I can setup copy tasks to move data from the Synology to the USB as desired, and includes various settings, such as incremental or versioned backups, triggers, and file filters.
SQL Backups: I had to refresh my memory on scheduling SQL backups in SQL Server. Once I had that done, I just made sure to back them up to a share on the Synology. From there, USB Copy took care of the rest.
Offsite: As I mentioned, I have had Cloud Sync running to Backblaze B2 buckets for a while. All I did was expand my copying. Cloud Sync offers some of the same flexibility as USB Copy, but having well-structured file shares for your data makes it easier to select and push data as you want it.
Results and What’s next
My home lab refresh took me about 2 weeks, albeit during a few evenings across that time span. What I am left with is a much more performant server. While I still store data on the Synology via NFS and iSCSi, it’s only smaller parts that are less reliant on fast access. The VM disks live on an SSD RAID array on the server, which gives me added stability and less thrashing of the Synology and its SSD cache. This is no more evident than the fact that my average daily SSD temp has gone down 12°F over the last 2 weeks.
What’s next? I will be taking a look at alternatives to Rancher Kubernetes Engine. I am hoping to find something a bit more stable and secure to manage.
Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.
It all started with an upgrade…
All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.
Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.
Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.
First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.
I lucked out
Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.
Cleaning Up
Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.
There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.
The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.
This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.
Lessons Learned
For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.
Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.
I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.
Home Lab – No More iSCSI: Transfer, Shutdown, and Rebuild (this post)
Home Lab – No More iSCSI: Backup plans (coming soon)
Observations – Migrating Servers
The focus of my hobby time over the few days has been moving production assets to the temporary server. Most of it is fairly vanilla, but I have a few observations worth noting.
I forgot how easy it was to replicate and failover VMs with Hyper-V. Sure, I could have tried a live migration, but creating a replica, shutting down the machine, and failing over was painless.
Do not forget to provision an external virtual switch on your Hyper-V servers. Yes, sounds stupid, but, I dove right in to setting the temporary server up as a replication server, and upon trying to failover, realized that the machine on the new server did not have a network connection.
I moved my Minio instance to the Synology: I originally had my Minio server running on an Ubuntu VM on my hypervisor, but decided moving the storage application closer to the storage medium was generally a good idea.
For my Kubernetes nodes, it was easier to provision new nodes on the temp server than it was to do a live migration or planned failover. I followed my normal process for provisioning new nodes and decommissioning old ones, and viola, my production cluster is on the temporary server. I will simply reverse the process for the transfer back.
I am getting noticeably better performance on the temporary server, which has far less compute and RAM, but the VMs are on local disks. While the Synology has been a rock solid, I think I have been throwing too much at it, and it can slow down from time to time.
Let me be clear: My network storage is by no means bad, and it will be utilized. But storing the primary vhdx files for my VMs on the hypervisor provides much better performance.
Shut It Down!
After successfully moving my production assets over to the temporary server, it was time to shut it down. I shut down the VMs that remained on the original hypervisor and attempted to copy the VMs to a network drive on the Synology. That was a giant mistake.
Those VM files already live on the Synology as part of an iSCSI volume. By trying to pull those files off of the iSCSI drive and copy them back to the Synology, I was basically doing a huge file copy (like, 600+ GB huge) without the systems really knowing it was copy. As you can imagine, the performance was terrible.
I found a 600TB SAS drive that I was able to plug into the old hypervisor, and I used that as a temporary location for the copy. Even with that change, the copy took a while (I think about 3 hours).
Upgrade and Install
I mounted my new SSDs (Samsung EVO 1TB) in some drive trays and plugged them into the server. A quick boot to the Smart Storage administrator let me setup a new drive array. While I thought about just using RAID 0 and letting me have 2 TB of stuff, I went the safe option and used RAID 1.
Having configured the temporary server with Windows Server Hyper-V 2019, the process of doing it again was, well, pretty standard. I booted to the USB stick I created earlier for Hyper-V 2019 and went through the paces. My domain controller was still live (thanks temporary server!), so I was able to add the machine to domain and then perform all of the management via the Server Manager tool on my laptop.
Moving back in
I have the server back up with a nice new 1TB drive for my VMs. That’s a far cry from the 4 TB of storage I had allocated on the SAN target on the Synology, so I have to be more careful with my storage.
Now, if I set a Hyper-V disk to, say, 100Gb, Hyper-V does not actually provision a file that is 100Gb: the vhdx file grows with time. But that does not mean I should just mindlessly provision disk space on my VMs.
For my Kubernetes nodes, looking at my usage, 50GB is more than enough for those disks. All persistent storage for those workloads is handled by an NFS provisioner which configures shares on the Synology. As for the domain controllers, I am able to run with minimal storage because, well, it is a tiny domain.
The problem children are Minio and my SQL Server Databases. Minio I covered above, moving it to the Synology directly. SQL Server, however, is a different animal.
Why be you, when you can be new!
I already had my production SQL instance running on another server. Rather than move it around and then mess with storage, I felt the safer solution was to provision a new SQL Server instance and migrate my databases. I only have 4 databases on that server, so moving databases is not a monumental task.
A new server affords me two things:
Latest and greatest version of Windows and SQL server.
Minimal storage on the hypervisor disk itself. I provisioned only about 80 GB for the main virtual disk. This worked fine, except that I ran into a storage compatibility issue that needed a small workaround.
SMB 3.0, but only certain ones
My original intent was to create a virtual disk on a network share on the Synology, and mount that disk to the new SQL Server VM. That way, to the SQL Server, the storage is local, but the SQL data would be on the Synology.
Hyper-V did not like this. I was able to create a vhdx file on a share just fine, but when I tried to add it to a VM using Add-VMHardDiskDrive, I got the following error:
RemoteSMBsharedoesnotsupportresiliency.
A quick Google search turned up this Spiceworks question, where the only answer suggests that the Synology SMB 3.0 implementation is Linux-based, where Hyper-V is looking to use the Windows-based implementation, and that there are things missing in Linux.
While I am usually not one to take one answer and call it fact, I also didn’t want to spend too much time getting into the nitty gritty. I knew it was a possibility that this wasn’t going to work, and, in the interest of time, I went back to my old pal iSCSI. I provisioned a small iSCSI LUN (300 GB) and mounted directly in the virtual machine. So now my SQL Server has a data drive that uses the Synology for storage.
And we’re back!
Moves like this provide an opportunity for consolidation, updates, and improvements, and I seized some of those opportunities:
I provisioned new Active Directory Domain controllers on updated operating systems, switched over, and deleted the old one.
I moved Minio to my Synology, and moved Hashicorp Vault to my Kubernetes cluster (using Minio as a storage backend). This removed 2 virtual machines from the hypervisor.
I provisioned a new SQL Server and migrated my production databases to it.
Compared to the rats nest of network configuration I had, the networking on the hypervisor is much simpler:
1 standard NIC with a static IP so that I can get in and out of the hypervisor itself.
1 teamed NIC with a static IP attached to the Hyper-V Virtual Switch.
For the moment, I did not bring back my “non-production” cluster. It was only running test/stage environments of some of my home projects. For the time being, I will most likely move these workloads to my internal cluster.
I was able to shut down the temporary server, meaning, at least in my mind, I am back to where I was. However, now that I have things on the hypervisor itself, my next step is to ensure I am appropriately backing things up. I will finish this series with a post on my backup configuration.
In an effort to get rid of a virtual machine on my hypervisor, I wanted to move my Minio instance to my Synology. Keeping the storage interface close to the storage container helps with latency and is, well, one less thing I have to worry about in my home lab.
There are a few guides out there for installing Minio on a Synology. Jaroensak Yodkantha walks you through the full process of setting up the Synology and Minio using a docker command line. The folks over at BackupAssist show you how to configure Minio through the Diskstation Manager web portal. I used the BackupAssist article to get myself started, but found myself tweaking the setup because I want to have SSL communication available through my Nginx reverse proxy.
The Basics
Prep Work
I went in to the Shared Folder section of the DSM control panel and created a new shared folder called minio. The settings on this share are pretty much up to you, but I did this so that all of my Minio data was in a known location.
Within the minio folder, I created a data folder and a blank text file called minio. Inside the minio file, I setup my minio configuration:
# MINIO_ROOT_USER and MINIO_ROOT_PASSWORD sets the root account for the MinIO server.# This user has unrestricted permissions to perform S3 and administrative API operations on any resource in the deployment.# Omit to use the default values 'minioadmin:minioadmin'.# MinIO recommends setting non-default values as a best practice, regardless of environmentMINIO_ROOT_USER=myadminMINIO_ROOT_PASSWORD=myadminpassword# MINIO_VOLUMES sets the storage volume or path to use for the MinIO server.MINIO_VOLUMES="/mnt/data"# MINIO_SERVER_URL sets the hostname of the local machine for use with the MinIO Server# MinIO assumes your network control plane can correctly resolve this hostname to the local machine# Uncomment the following line and replace the value with the correct hostname for the local machine.MINIO_SERVER_URL="https://s3.mattsdatacenter.net"MINIO_BROWSER_REDIRECT_URL="https://storage.mattsdatacenter.net"
It is worth noting the URLs: I want to put this system behind my Nginx reverse proxy and let it do SSL termination, and in order to do that, I found it easiest to use two domains: one for the API and one for the Console. I will get into more details on that later.
Also, as always, change your admin username and password!
Setup the Container
Following the BackupAssist article, I installed the Docker package on to my Synology and opened it up. From the Registry menu, I searched for minio and found the minio/minio image:
Click on the row to highlight it, and click on the Download button. You will be prompted for the label to download, I chose latest. Once the image is downloaded (you can check the Image tab for progress), go to the Container tab and click Create. This will open the Create Wizard and get you started.
On the Image screen, select the minio/minio:latest image.
On the Network screen, select the bridge network that is defaulted. If you have a custom network configuration, you may have some work here.
On the General Settings screen, you can name the container whatever you like. I enabled the auto-restart option to keep it running. On this screen, click on the Advanced Settings button
In the Environment tab, change MINIO_CONFIG_ENV_FILE to /etc/config.env
In the Execution Command tab, change the execution command to minio server --console-address :9090
Click Save to close Advanced Settings
On the Port Settings screen, add the following mappings:
Local Port 39000 -> Container Port 9000 – Type TCP
Local Port 39090 -> Container Port 9090 – Type TCP
On the Volume Settings Screen, add the following mappings:
Click Add File, select the minio file created above, and set the mount path to /etc/config.env
Click Add Folder, select the data folder created above, and set the mount path to /mnt/data
At that point, you can view the Summary and then create the container. Once the container starts, you can access your Minio instance at http://<synology_ip_or_hostname>:39090 and log in with the password saved in your config file.
What Just Happened?
The above steps should have worked to create a Docker container running on Synology on your Minio. Minio has two separate ports: one for the API, and one for the Console. Reviewing Minio’s documentation, adding the --console-address parameter in the container execution is required now, and that sets the container port for the console. In our case, we set it to 9090. The API port defaults to 9000.
However, I wanted to run on non-standard ports, so I mapped ports 39090 and 39000 to port 9090 and 9000, respectively. That means that traffic coming in on 39090 and 39000 get routed to my Minio container on ports 9090 and 9000, respectively.
Securing traffic with Nginx
I like the ability to have SSL communication whenever possible, even if it is just within my home network. Most systems today default to expecting SSL, and sometimes it can be hard to find that switch to let them work with insecure connections.
I was hoping to get the console and the API behind the same domain, but with SSL, that just isn’t in the cards. So, I chose s3.mattsdatacenter.net as the domain for the API, and storage.mattsdatacenter.net as the domain for the Console. No, those aren’t the real domain names.
With that, I added the following sites to my Nginx configuration:
This configuration allows me to access the API and Console via domains using SSL terminated on the proxy. Configuring Minio is pretty easy: set MINIO_BROWSER_REDIRECT_URL to the URL of your console (In my case, port 39090), and MINIO_SERVER_URL to the URL of your API (port 39000).
This configuration allows me to address Minio for S3 in two ways:
Use https://s3.mattsdatacenter.net for secure connectivity through the reverse proxy.
Use http://<synology_ip_or_hostname>:39000 for insecure connectivity directly to the instance.
I have not had the opportunity to test the performance difference between option 1 and option 2, but it is nice to have both available. For now, I will most likely lean towards the SSL path until I notice degradation in connection quality or speed.
And, with that, my Minio instance is now running on my Diskstation, which means less VMs to manage and backup on my hypervisor.
This post is part of a short series on migrating my home hypervisor off of iSCSI.
Home Lab – No More iSCSI: Prep and Planning (this post)
Home Lab – No More iSCSI: Shutdown and Provisioning (coming soon)
Home Lab – No More iSCSI: Backup plans (coming soon)
I realized today that my home lab setup, by technology standards, is old. Sure, my overall setup has gotten some incremental upgrades, including an SSD cache for the Synology, a new Unifi Security Gateway, and some other new accessories. The base Hyper-V server, however, has remained untouched, outside of the requisite updates.
Why no upgrades? Well, first, it is my home lab. I am a software engineer by trade, and the lab is meant for me to experiment not with operating systems or network configurations, but with application development and deployment procedures, tools, and techniques. And for that, it has worked extremely well over the last five years.
That said, my initial setup had some flaws, and I am seeing some stability issues that I would like to correct now, before I wake up one morning with nothing working. With that in mind, I have come up with a plan.
Setup a Temporary Server
I am quite sure you’re thinking to yourself “What do you mean, temporary server?” Sure, I could shut everything down, copy it off the server onto the Synology, and then re-install the OS. And while this is a home lab and supposedly “throw away,” there are some things running that I consider production. For example:
Unifi Controller – I do not yet have the luxury of running a Unifi Dream Machine Pro, but it is on my wish list. In the meantime, I run an instance of the Unifi controller in my “production” cluster.
Home Assistant – While I am still rocking an ISY994i as an Insteon interface, I moved most of my home automation to a Home Assistant instance in the cluster.
Node Red – I have been using Node-Red with the Home Assistant palette to script my automations.
Windows Domain Controller – I am still rocking a Windows Domain at home. It is the easiest way to manage the Hypervisor, as I am using the”headless” version of Windows Hyper-V Server 2019.
Mattgerega.com – Yup, this site runs in my cluster.
Thankfully, my colleague Justin happened to have an old server lying around that he has not powered on in a while, and has graciously allowed me to borrow it so that I can transfer my production assets over and keep things going.
We’re gonna change the way we run…
My initial setup put the bulk of my storage on the Synology via iSCSI, so much so that I had to put an SSD cache in the Synology just to keep up. At the beginning, that made sense. I was running mostly Windows VMs, and my vital data was stored on the VM itself. I did not have a suitable backup plan, so having all that data on the Synology meant I had at least some drive redundancy.
Times have changed. My primary mechanism for running applications is via Kubernetes clusters. Those nodes typically contain no data at all, as I use an NFS provisioner and storage class to create persistent storage volumes via NFS on the Synology. And while I still have a few VMs with data on them that will need backed up, I really want to get away from iSCSI.
The server I have, an old HP Proliant DL380 Gen8, has 8 2.5″ drive bays. My original impression was that I needed to buy SAS drives for it, but Justin said he has had luck running SATA SSDs in his.
Requirements
Even with a home lab move, it is always good to have some clear requirements.
Upgrade my Hyper-V Server to 2019.
No more iSCSI disks on the server: Rely on NFS and proper backup procedures.
Fix my networking: I had originally teamed 4 of the 6 NIC ports on the server together. While I may still do that, I need to clean up that implementation, as I have learned a lot in the last few years.
Keep is simple.
Could I explore VMWare or Proxmox? I could, but, frankly, I want to learn more about Kubernetes and how I can use it in application architecture to speed delivery and reliability. I do not really care what the virtualization technology is, as long as I can run Kubernetes. Additionally, I have a LOT of automation around building Hyper-V machines, I do not want to do it again.
Since this is my home lab, I do not have a lot of time to burn on it, hence the KISS methods. Switching virtualization stacks means more time converting images. Going from Hyper-V to Hyper-V means, for production VMs, I can setup replication and just move them to the temp server and back again.
Prior Proper Planning
With my requirements set, I created a plan:
Configure the temporary server and get production servers moved. This includes consolidating “production” databases into a single DB server, which is a matter of moving one or two DBs.
Shut down all other VMs and copy them over to a fileshare on the Synology.
Fresh installation of Windows Server 2019 Hyper-V.
Add 2 1TB SSDs into the hypervisor in a RAID 1 array.
Replicate the VMs from the temporary server to the new hypervisor.
Copy the rest of the VMs to the new server and start them up.
Create some backup procedures for data stored on the hypervisor (i.e., if it is on a VM’s drive, it needs put on the Synology somewhere)
Delete my iSCSI LUN from the Synology.
So, what’s done?
I am, quite literally, still on step one. I got the temporary server up and running with replication, and I am starting to move production images. Once my temporary production environment is running, I will get started on the new server. I will post some highlights of that process in the days to come.
As I approach the five year anniversary of this blog, I got to wondering just what my post frequency is and how it might affect my overall readership. In my examination of that data, I learned an important lesson about content and traffic: if you build it, they will come.
My Posting Habits
Posts per Month, 2018-2022
As you can tell from the above graph, for the first, oh, three years, my posting was sporadic at best. In the first 36 months of existence, I posted a total of 16 times, averaging about 0.4 posts per month. Even worse, 9 of those posts occurred in 3 months. That means, a little over half of my posts were generated in what amounts to 8% of the total time period.
With some inspiration from my former CTO, Raghu Chakravarthi, I committed to both steady and increase my writing in June of 2021. And the numbers show that initiative: in the 19 months from June 2021 to December of 2022, I posted 54 times, increasing my average to 2.8 posts per month. In addition to more posts, I started creating LinkedIn posts to go along with my blog posts, in the hopes of generating some traffic to the site.
People!!
I have the free version of Google Site Analytics hooked up to get an idea of traffic month of month. While I do not get a lot of history, I can say that, in recent months, my site traffic continues to grow, even if it’s only by 40-50 unique users per month.
Most of this traffic, though, has been generated not through my LinkedIn posts, but through Google searches. My top channels in the last two months are, far and away, organic search:
Top Channels for mattgerega.com
This tells me that most people find my site through a Google search, not my LinkedIn posts.
Increasing Visibility
My experience with my home blog has inspired a change in approach to the visibility of my architects at work. I manage a handful of architects, and there have been several asks by the team to create conduits for informing people about our current research and design work. In the past, we have tossed around ideas such as a blog post or a newsletter to get people interested. However, we never focused on the content.
If my home blogging has taught me nothing else, it is that, sometimes, content creation is a matter of quantity, not quality. Not everything I do is a Pulitzer prize-winning piece of journalism… Ok, NOTHING I write is of that level, but you get the idea: Sometimes it is simply about putting the content out there and then identifying what people interests people. When I switched my blogging over to focus less on perfection and more on information sharing, I was able to increase the amount of content I create. This, subsequently, allowed me to reach different people.
So, if it worked at home, why not try it at work? My team is in the middle of creating content based on their work. It does not have to be perfect, but we all have a goal to post a blog post in our Confluence instance at least once in a two week period. Hopefully, through this content generation, and some promotion by yours truly, we can start to increase the visibility of the work our team does to those outside of the team.
My last few posts have centered around adding some code linting and analysis to C# projects. Most of this has been to identify some standards and best practices for my current position.
During this research, I came across SonarCloud, which is Sonarqube’s hosted instance. SonarCloud is free for open source projects, and given the breadth of languages it supports, I have decided to start adding my open source projects to SonarCloud. This will allow some extra visibility into my open source code and provide me with a great sandbox for evaluating Sonarqube for corporate use.
I added Sonar Analysis to a GitHub actions pipeline for my Hyper-V Info API. You can see the Sonar analysis on SonarCloud.io.
The great part?? All the code is public, including the GitHub Actions pipeline. So, feel free to poke around and see how I made it work!
In a previous post, I had a to-do list that included managing my Hyper-V VMs so that they did not all start at once. I realized today that I never explained what I was able to do or post the code for my solution. So today, you get both.
And, for the impatient among you, my repository for this API is on Github.
Managing VMs with Powershell
My plan of attack was something like this:
Organize Virtual Machines into “Startup Groups” which can be used to set Automatic Start Delays
Using the group and an offset within the group, calculate a start delay and set that value on the VM.
Powershell’s Hyper-V Module is a powerful and pretty easy way to interact with the Hyper-V services on a particular machine. The module itself had all of the functionality I needed to implement my plan. This included the ability to modify the Notes of a VM. I am storing JSON in the Notes field to denote the start group and offset within the group. Powershell has the built in JSON conversion necessary to make quick work of retrieving this data from the VM’s Notes field and converting it into an object.
Creating the API
For the API, this seemed an appropriate time to try out the Minimal APIs in ASP.NET Core 6. Minimal APIs are Microsoft’s approach to building APIs fast, without all the boilerplate code that sometimes comes with .Net projects. For this project, as I only had three endpoints (and maybe some test/debug ones) and a few services, so it seemed a good candidate.
Without getting into the details, I was pleased with the approach, although scaling this type of approach requires implementing some standards that, in the end, would have you re-designing the notion of Controllers as it exists in a typical API project. So, while it is great for small, agile APIs, if you expect your API to grow, stick with the Controller-structured APIs.
Hosting the API
The server I am using as a Hyper-V hypervisor is running a version of Windows Hyper-V server, which means it has a limited feature set that does not include Internet Information Systems (IIS). Even if it did, I want to keep the hypervisor focused on running VMs. However, in order to manage the VMs, the easiest path is to put the API on the hypervisor.
With that in mind, I went about configuring this API to run within a Windows Service. That allowed me to ensure the API was running through standard service management (instead of as a console application) but still avoid the need for a heavy IIS install.
I installed the service using one of the methods described in How to: Install and uninstall Windows services. However, for proper access, the service needs to run as a user with Powershell access and rights to modify the VMs.
I created a new domain user and granted it the ability to perform a service log on via local security policy. See Enable Service Logon for details.
Prepping the VMs
The API does not, at the moment, pre-populate the Notes field with JSON settings. So I went through my VM List and added the following JSON snippet:
{"startGroup": 0,"delayOffset": 0}
I chose a startGroup value based on the VM’s importance (Domain Controllers first, then data servers, then Kubernetes nodes, etc), and then used the delayOffset to further stagger the start times.
All this for an API call
Once each VM has the initialization data, I made a call to /vm/refreshdelay and viola! The AutomaticStartDelay gets set based on its startGroup and delayOffset.
There is more to do (see my to-do list in my previous post for other next steps), but since I do not typically provision many machines, this one usually falls to a lower spot on the priority list. So, well, I apologize in advance if you do not see more work on this for another six months.