Category: Home Lab

  • Tech Tip – Interacting with ETCD in Rancher Kubernetes Engine 2

    Since cycling my cluster nodes is a “fire script and wait” operation, I kicked one off today. I ended up running into an issue that required me to dig a bit into ETCD in RKE2, and could not find direct help, so this is as much my own reference as it is a guide for others.

    I broke it…

    When provisioning new machines, I still have some odd behaviors when it comes to IP address assignment. I do not set the IP address manually: I use a static MAC address on the VM and then create a fixed IP for that MAC address. About 90% of the time, that works great. Every so often, though, in the provisioning process, the VM picks up an IP address from the DHCP instead of the fixed IP, and that wrecks stuff, especially around ETCD.

    This happened today: In standing up a replacement, the new machine picked up a DHCP IP. Unfortunately, I didn’t remove the machine properly, which caused my ETCD cluster to still see the node as a member. When I deleted the node and tried to re-provision, I got ETCD errors because I was trying to add a node name that already exists.

    Getting in to ETCD

    RKE2’s docs are a little quiet on actually viewing what’s in ETCD. Through some googling, I figured out that I could use etcdctl to show and manipulate members, but I couldn’t figure out how to actually run the command.

    As it turns out, the easiest way to run it is to run it on one of the ETCD pods itself. I came across this bug report in RKE2 that indirectly showed me how to run etcdctl commands from my machine through the ETCD pods. The member list command is

    kubectl -n kube-system exec <etcd_pod_name> -- sh -c "ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list"

    Note all the credential setting via environment variables. In theory, I could “jump in” to the etcd pod using a simple sh command and run a session, but keeping it like this forces me to be judicious in my execution of etcdctl commands.

    I found the offending entry and removed it from the list, and was able to run my cycle script again and complete my updates.

  • What’s in a home lab?

    A colleague asked today about my home lab configuration, and I came to the realization that I have never published a good inventory of the different software and hardware that I run as part of my home lab / home automation setup. While I have documented bits and pieces, I never pushed a full update. I will do my best to hit the highlights without boring everyone.

    Hardware

    I have a small cabinet in my basement mechanical room which contains the majority of my hardware, with some other devices sprinkled around.

    This is all a good mix of new and used stuff: Ebay was a big help. Most of it was procured over several years, including a number of partial updates to the NAS disks

    • NAS – Synology Diskstation 1517+. This is the 5-bay model. I added the M2D18 expansion card, and I currently have 5 x 4TB WD Red Drives and 2 x 1GB WD SSDs for cache. Total storage in my configuration is 14TB.
    • Server – HP ProLiant DL380p Gen8. Two Xeon E5-2660 processors, 288 GB of RAM, and two separate RAID arrays. The system array is 136GB, while the storage array is 1TB.
    • Network
      • HP ProCurve Switch 2810-24G – A 24 port GB switch that serves most of my switching needs.
      • Unifi Security Gateway – Handles all of my incoming/outgoing traffic through the modem and provides most of my high-level network capabilities.
      • Unifi Access Points – Three in total, 2 are the UAP-AC-LR models, the other is the UAP-AC-M outdoor antenna.
      • Motorola Modem – I did not need the features of the Comcast/Xfinity modem, nor did I want to lease it, so I bought a compatible modem.
    • Miscellaneous Items
      • BananaPi M5 – Runs Nginx as a reverse proxy into my network.
      • RaspberryPi 4B+ – Runs Home Assistant. This was a recent move, documented pretty heavily in a series of posts that starts here.
      • RaspberryPi Model B – That’s right, an O.G. Pi that runs my monitoring scripts to check for system status and reports to statuspage.io.
      • RaspberryPi 4B+ – Mounted behind the television in my office, this one runs a copy of MagicMirror to give me some important information at a glance.
      • RaspberryPi 3B+ – Currently dormant.

    Software

    This one is lengthy, so I broke it down into what I hope are logical and manageable categories.

    The server is running Windows Hyper-V Server 2019. Everything else, unless noted, is running on a VM on that server.

    Server VMs

    • Domain Controllers – Two Windows domain controllers (primary and secondary).
    • SQL Servers – Two SQL servers (non-production and production). It’s a home lab, so the express editions suffice.

    Kubernetes

    My activities around Kubernetes are probably the most well-documented of the bunch, but, to be complete: Three RKE2 Kubernetes clusters. Two three-node clusters and one four-node cluster to run internal, non-production, and production workloads. The nodes are Ubuntu 22.04 images with RKE2 installed.

    Management and Monitoring Tools

    For some management and observability into this system, I have a few different software suites running.

    • Unifi Controller – This makes management of the USG and Access points much easier. It is currently running in the production cluster using the jacobalberty image.
    • ArgoCD – Argo is my current GitOps operator and is used to make sure what I want deployed on my clusters is out there.
    • LGTM Stack – I have instances of Loki, Grafana, Tempo, and Mimir running in my internal cluster, acting as the target for metrics data.
    • Grafana Agent – For my VMs and other hardware that supports it, I installed Grafana Agent and configured them to report metrics and logs to Mimir/Loki.
    • Hashicorp Vault – I am running an instance of Hashicorp Vault in my clusters to provide secret management, using the External Secrets operator to provide cached secret management in Kubernetes.
    • Minio – In order to provide a local storage instance with S3 compatible APIs, I’m running Minio as a docker image directly on the Synology.

    Cluster Tools

    Using Application Sets and the Cluster generator, I configured a number of “cluster tools” which allow me to install different tools to clusters using labels and annotations on the Argo cluster Secret resource.

    This allows me to install multiple tools using the same configuration, which improves consistency. The following are configured for each cluster.

    • kube-prometheus – I use Bitnami’s kube-prometheus Helm chart to install an instance of Prometheus on each cluster. They are configured to remote-write to Mimir.
    • promtail – I use the promtail Helm chart to install an instance of Promtail on each cluster. They are configured to remote-write to Mimir.
    • External Secrets – The External Secrets operator helps bootstrap connection to a variety of external vaults and creates Kubernetes Secret resources from the ExternalSecret / ExternalClusterSecret custom resources.
    • nfs-subdir-external-provisioner – For PersistantVolumes, I use the nfs-subdir-external-provisioner and configure it to point to dedicated NFS shares on the Synology NAS. Each cluster has its own folder, making it easy to backup through the various NAS tools
    • cert-manager – While I currently have cert-manager installed as a cluster tool, if I remember correctly, this was for my testing of Linkerd, which I’ve since removed. Right now, my SSL traffic is offloaded at the reverse proxy. This has multiple benefits, not the least of which is that I was able to automate my certificate renewals in one place. Still, cert-manager is available but no certificate stores are currently configured.

    Development Tools

    It is a lab, after all.

    • Proget – I am running the free version of Proget for private Nuget and container image feeds. As I move to open source my projects, I may migrate to Github artifact storage, but for now, it is stored locally.
    • SonarQube Community – I am running an instance of SonarQube community for quality control. However, as with Proget, I have begun moving some of my open source projects to Sonarcloud.io, so this instance may fall away.

    Custom Code

    I have a few projects, mostly small APIs that allow me to automate some of my tasks. My largest “project” is my instance of Identity Server, which I use primarily to lock down my other APIs.

    And of course…

    WordPress. This site runs in my production cluster, using the Bitnami chart, which includes the database.

    And there you go…

    So that is what makes up my home lab these days. As with most good labs, things are constantly changing, but hopefully this snapshot presents a high level picture into my lab.

  • Tech Tip – You should probably lock that up…

    I have been running in to some odd issues with ArgoCD not updating some of my charts, despite the Git repository having an updated chart version. As it turns out, my configuration and lack of Chart.lock files seems to have been contributing to this inconsistency.

    My GitOps Setup

    I have a few repositories that I use as source repositories for Argo. The contain mix of my own resource definition files, which are raw manifest files, and external Helm charts. The external Helm charts use an umbrella chart to allow me the ability to add supporting resources (like secrets). My Grafana chart is a great example of it.

    Prior to this, I was not including the Chart.lock file in the repository. This made it easier to update the version in the Chart.yaml file without having to run a helm dependency update to update the lock file. I have been running this setup for at least a year, and I never really noticed much problem until recently. There were a few times where things would not update, but nothing systemic.

    And then it got worse

    More recently, however, I noticed that the updates weren’t taking. I saw the issue with both the Loki and Grafana charts: The version was updated, but Argo was looking at the old version.

    I tried hard refreshes on the Applications in Argo, but nothing seemed to clear that cache. I poked around in the logs and noticed that Argo runs helm dependency build, not helm dependency update. That got me thinking “What’s the difference?”

    As it turns out, build operates using the Chart.lock file if it exists, otherwise it acts like upgrade. upgrade uses the Chart.yaml file to install the latest.

    Since I was not committing my Chart.lock file, it stands to reason that somewhere in Argo there is a cached copy of a Chart.lock file that was generated by helm dependency build. Even though my Chart.yaml was updated, Argo was using the old lock file.

    Testing my hypothesis

    I committed a lock file 😂! Seriously, I ran helm dependency update locally to generate a new lock file for my Loki installation and committed it to the repository. And, even though that’s the only file that changed, like magic, Loki determined it needed an update.

    So I need to lock it up. But, why? Well, the lock file exists to ensure that subsequent builds use the exact version you specify, similar to npm and yarn. Just like npm and yarn, helm requires a command to be run to update libraries or dependencies.

    By not committing my lock file, the possibility exists that I could get a different version than I intended or, even worse, get a spoofed version of my package. The lock file maintains a level of supply chain security.

    Now what?

    Step 1 is to commit the missing lock files.

    At both work and home I have Powershell scripts and pipelines that look for potential updates to external packages and create pull requests to get those updates applied. So step 2 is to alter those scripts to run helm dependency update when the Chart.yaml is updated, which will update the Chart.lock and alleviate the issue.

    I am also going to dig into ArgoCD a little bit to see where these generated Chart.lock values could be cached. In testing, the only way around it was to delete the entire ApplicationSet, so I’m thinking that the ApplicationSet controller may be hiding some data.

  • Rollback saved my blog!

    As I was upgrading WordPress from 6.2.2 to 6.3.0, I ran into a spot of trouble. Thankfully, ArgoCD rollback was there to save me.

    It’s a minor upgrade…

    I use the Bitnami WordPress chart as the template source for Argo to deploy my blog to one of my Kubernetes clusters. Usually, an upgrade is literally 1, 2, 3:

    1. Get the latest chart version for the WordPress Bitnami chart. I have a Powershell script for that.
    2. Commit the change to my ops repo.
    3. Go into ArgoCD and hit Sync

    That last one caused some problems. Everything seemed to synchronize, but the WordPress pod stopped at the connect to database section. I tried restarting the pod, but nothing.

    Now, the old pod was still running. So, rather than mess with it, I used Argo’s rollback functionality to roll the WordPress application back to it’s previous commit.

    What happened?

    I’m not sure. You are able to upgrade WordPress from the admin panel, but, well, that comes at a potential cost: If you upgrade the database as part of the WordPress upgrade, but then you “lose” the pod, well, you lose the application upgrade but not the database upgrade, and you are left in a weird state.

    So, first, I took a backup. Then, I started poking around in trying to run an upgrade. That’s when I ran into this error:

    Unknown command "FLUSHDB"

    I use the WordPress Redis Object Cache to get that little “spring” in my step. It seemed to be failing on the FLUSHDB command. At that point, I was stuck in a state where the application code was upgraded but the database was not. So I restarted the deployment and got back to 6.2.2 for both application code and database.

    Disabling the Redis Cache

    I tried to disable the Redis plugin, and got the same FLUSHDB error. As it turns out, the default Bitnami Redis chart disables these commands, but it would seem that the WordPress plugin still wants them.

    So, I enabled the commands in my Redis instance (a quick change in the values files) and then disable the Redis Cache plugin. After that, I was able to upgrade to WordPress 6.3 through the UI.

    From THERE, I clicked Sync in ArgoCD, which brought my application pods up to 6.3 to match my database. Then I re-enabled the Redis Plugin.

    Some research ahead

    I am going to check with the maintainers of the Redis Object Cache plugin. If they are relying on commands that are disabled by default, it most likely caused some issues in my WordPress upgrade.

    For now, however, I can sleep under the warm blanket of Argo roll backs!

  • When browsers have a mind of their own…

    I recently implemented Cloudflare on my domains to allow me some protection on my domains. There were some unintended side effects, and it seems to have to do with my browser’s DNS.

    Internal Site Management

    All of my mattgerega.com and mattgerega.org sites are available outside of my network, so putting Cloudflare in front as a WAF was not a problem. However, I have a few sites hosted on mattgerega.net which, through security in the reverse proxy, are not accessible outside of my home network.

    My internal DNS has an entry to push the mattgerega.net sites directly to my reverse proxy, and should bypass the WAF. Non-browser traffic, like a simple curl command, works just fine. However, in the browser, I was getting 403 errors most of the time. Oddly, incognito/private modes sometimes fixed it.

    My Browser has a DNS Cache

    This is where I learned something new: modern browsers have their own DNS cache. Want to see yours? In Chrome, navigate to chrome://net-internals/#dns. There is a lookup function as well as an ability to clear the DNS cache.

    Using the nslookup in my browser, mattgerega.net resolved to some Cloudflare IPs. Even though my machine is using my internal DNS, the browser has other ideas.

    No Proxy For You!

    This was all, well, rather annoying. Because mattgerega.net was being proxied, the browser DNS seemed to be preferring the Cloudflare DNS over my local one. Perhaps because it is secure, even though I turned off the requirement for secure DNS in the browser.

    The solution was to stop proxying mattgerega.net in Cloudflare. This allowed my local DNS entries to take over, and my access was fixed without me having to open up to other IP addresses.

    Now I just have to do better about making the mattgerega.net sites internal-only.

  • Literally every time I go away…

    It really truly seems like every time I go away, something funny happens in the lab and stuff goes down. That pattern continued last week.

    Loads of Travel

    A previously planned Jamaican getaway butted up against a short-notice business meeting in Austin. Since I could not get a direct flight to either location, I ended up on 8 flights in 8 days.

    The first four were, for lack of a more descriptive word, amazing! My wife and I took a short trip to Couples Swept Away in Negril for a short holiday. While I knew it was an all-inclusive, I did not realize that included scuba diving. I took advantage of that and dove five times in three days. Thankfully, I was always back by noon for a few much needed beach naps.

    The Saturday that we were gone, I got a few text messages from myself indicating that my websites were down. Since I got the messages, well, that meant I had power to the Raspberry Pi that runs the monitoring software and my internet was up. I tried logging in, but nothing that was attached to the NAS was coming up… So the NAS was down.

    A quick troubleshooting session

    Since there was not much I could do from the beach, I put it away until I got home Tuesday night. Now, mind you, I had a flight out early Wednesday morning for Austin, but I really could not sleep until I figured out what was wrong.

    As it turns out, we must have had a power brownout/blackout. The NAS went down and did not reboot. I ended up having to unplug it to force a restart.

    That was, well, the extent of my troubleshooting: once the NAS came back up, Kubernetes took over and made sure everything was running again.

    Move to the cloud?

    My home lab is just a lab… When stuff goes down, well, no one should care but me. And that is true for everything except this site. I am torn between spending some money to host this site externally, or fortifying my current setup with some additional redundant systems.

    What kinds of fortifications, you might ask… Well, a proper UPS would be nice, avoiding the hard shutdowns that could wreck my systems. But, there’s no way I am going to pay for redundant internet to my house, so I will always be tied to that.

    For now, I am ok with hosting as I am today and having to live with some downtime. March was a rough month with a few days of outage, but I generally hit 98-99% uptime, based on the data from my StatusPage site. For a home lab, I would say that is pretty good.

  • Speed. I.. am.. Speed.

    I have heard the opening to the Cars movie more times than I can count, and Owen Wilson’s little monologue always sticks in my head. Tangentially, well, I recently logged in to Xfinity to check my data usage, which sent me down a path towards tracking data usage inside of my network. I learned a lot about what to do, and what not to do.

    We are using how much data?

    Our internet went out on Sunday. This is not a normal occurrence, so I turned off the wifi on my phone and logged in to Xfinity’s website to report the problem. Out of sheer curiosity, after reporting the downtime, I clicked on the link to see our usage.

    22 TB… That’s right, 22 terabytes of data in February, and approaching 30TB for March. And there were still 12 days left in March! Clearly, something was going on.

    Asking the Unifi Controller

    I logged in to the Unifi controller software in the hopes of identifying the source of this traffic. I jumped into the security insights, traffic identification, and looked at the data for the last month. Not one client showed more than 25 GB of traffic in the last month. That does not match what Xfinity is showing at all.

    A quick Google search lead me to a few posts that suggest that the Unifi’s automated speed test can boost your data usage, but that it doesn’t show on the Unifi. Mind you, these posts were 4+ years old, but I figured it was worth a shot. So I disabled the speed test in the Unifi controller, but would have to wait a day to see if the Xfinity numbers changed.

    Fast forward a day – No change. According to Xfinity I was using something like 500GB of data per day, which is nonsense. My previous months were never higher than 2TB, so using that much data in 4 days means something is wrong.

    Am I being hacked?

    Thanks to “security first” being beat into me by some of my previous security-focused peers, the first thought in my head was “Am I being hacked?” I looked through logs on the various machines and clusters, trying to find where this data was coming from and why. But nothing looked odd or out of the ordinary: no extra pods running calculations, no servers consuming huge amounts of memory or CPU. So where in the world was 50TB of data coming from?

    Unifi Poller to the Rescue

    At this point, I remembered that I have Unifi Poller running. The poller grabs data from my Unifi controller and puts it into Mimir. I started poking around the unpoller_ metrics until I found unpoller_device_bytes_total. Looking at that value for my Unifi Security Gateway, well, I saw this graph for the last 30 days:

    unpoller_device_bytes_total – Last 30 Days

    The scale on the right is bytes, so, 50,000,000,000,000 bytes, or roughly 50TB. Since I am not yet collecting the client DPI information with Unifi Poller, I just traced this data back to the start of this ramp up. It turned out to be February 14th at around 12:20 pm.

    GitOps for the win

    Since my cluster states are stored in Git repos, any changes to the state of things are logged as commits to the repository, making it very easy to track back. Combing through my commits for 2/14 around noon, I found the offending commit in the speedtest-exporter (now you see the reference to Lightning McQueen).

    In an effort to move off of the k8s-at-home charts, which are no longer being maintained, I have switch over to creating charts using Bernd Schorgers’ library chart to manage some of the simple installs. The new chart configured the ServiceMonitor to scrape every minute, which meant, well, that I was running a speed test every minute. Of every day. For a month.

    To test my theory, I shut down the speedtest-exporter pod. Before my change, I was seeing 5 and 6 GB of traffic every 30 seconds. With the speed test executing hourly, I am seeing 90-150 MB of traffic every 30 seconds. Additionally, the graph is much more sporadic, which makes more sense: I would expect traffic to increase when my kids are home and watching TV, and decrease at night. What I was seeing was a constant increase over time, which is what pointed me to the speed test. So I fixed the ServiceMonitor to only scrape once an hour, and I will check my data usage in Xfinity tomorrow to see how I did.

    My apologies to my neighbors and Xfinity

    Breaking this down, I have been using something around 1TB of bandwidth per day over the last month. So, I apologize to my neighbors for potentially slowing everything down, and to Xfinity for running speed tests once a minute. That is not to say that I will stop running speed tests, but I will go back to testing once an hour rather than once a minute.

    Additionally, I’m using my newfound knowledge of the Unifi Poller metrics to write some alerts so that I can determine if too much data is coming in and out of the network.

  • Going Banana

    That’s right… just one banana. I have been looking to upgrade the Raspberry PI 3 that has been operating as home lab’s reverse proxy. While it would have been more familiar to find another Raspberry Pi 4 to use, their availability is, well, terrible. I found a workable, potentially more appropriate, solution in the Banana Pi M5.

    If imitation is the sincerest form of flattery…

    Then the Raspberry Pi Foundation should be blushing so much they may pass out. A Google search of “Rasbperry Pi Alternatives 2023” leads to a trove of reviews on various substitutes. Orange Pis, Rock Pis, Banana Pis…. where do I begin?

    It is suffice to say that Single Board Computers (SBCs) have taken a huge step forward in the past few years, and many companies are trying to get in the game. It became clear that, well, I needed a requirements list.

    Replacing the Pi3 Proxy

    I took a few minutes to come up with my requirements list:

    • Ubuntu – My Pi3 Proxy has been running Nginx on Ubuntu for over a year. I’m extremely comfortable with the setup that I have, including certbot for automating SSL and the Grafana Agent to report statistics to Mimir. My replacement needs to run Ubuntu, since I have no desire to learn another distro.
    • Gigabit Ethernet – The Pi3 does not support true Gigabit ethernet because of the USB throughput. I want to upgrade, since the proxy is handling all of my home lab traffic. Of note, though: I do not need Wifi or Bluetooth support.
    • Processor/Memory – The Pi3 runs the 1.4 GHz Quad Core Cortex A53 processor with a whopping 1GB of RAM. Truthfully, the Pi3 handles the traffic well, but an upgrade would be nice. Do I need 8GB of RAM? Again, nice to have: my minimum is 4 GB.
    • eMMC – Nginx does a lot of logging, and I worry a bit about the read/write limits on the SD cards. As I did my research, a few of the Pi alternatives have eMMC flash memory onboard. This would be a bit more resilient than an SD card, and should be faster. There are also some hats to support NVMe drives. So, yes, I want some solid memory.

    Taking this list of requirements, I started looking around, and one board stood out: the Banana Pi M5.

    Not the latest Banana in the bunch

    The Banana Pi M5 is not the newest model from Banana Pi. The M6 is their latest offering, and sports a much stronger chipset. However, I had zero luck finding one in stock for a reasonable price. I found a full M5 kit on Amazon for about $120 USD.

    The M5’s Cortex-A55 is a small step up from the RPi3 and sports 4GB of RAM, so my processor/memory requirements were met.

    Gigabit ethernet? Check. The M5 has no built-in Wifi, but, for what I need it for, I frankly do not care.

    Ubuntu? This one was tough to source: their site shows downloads for Ubuntu 20.04 images, but I had to dig around the Internet to verify that someone was able to run a release upgrade to get it to 22.04.

    eMMC? A huge 16GB eMMC flash chip. Based on my current usage, this will more than cover my needs.

    The M5 looked to be a great upgrade to my existing setup without breaking the bank or requiring me to learn something new. Would it be that easy?

    Making the switch

    After receiving the M5 (in standard Amazon 2 day fashion), I got to work. The kit included a “built it yourself” case, heatsinks, and a small fan. After a few minutes of trying to figure out how the case went together, I had everything assembled.

    Making my way over to the M5 Wiki, I followed the steps on the page. Surprisingly, it really was that simple. I imaged an SD card so I could boot Ubuntu, then followed their instructions for installing the Linux image to EMMC. I ejected the SD Card, rebooted, and I was up and running.

    A quick round of apt upgrade and do-release-upgrade later, and I was running Ubuntu 22.04. Installed nginx, certbot, and grafana-agent, copied my configuration files over from the old Pi (changing the hostnames, of course), and I was re-configured in easily under 30 minutes.

    The most satisfying portion of this project was, oddly enough, changing the DNS entries and port forwarding rules to hit the M5 and watching the log entries switch places:

    The green line is log entries from the M5, the yellow line is log entries from the Pi3. You can see I had some stragglers hitting the old pi, but once everything flushed out, the Pi was no longer in use. I shut it down to give it a little break for now, as I contemplate what to do with it next.

    Initial Impressions

    The M5 is certainly snappier, although load levels are about the same as were reported by the RPi3. The RPi3 was a rock, always on, always working. I hope for the same with the M5, but, unfortunately, only time will tell.

  • Speeding up Packer Hyper-V Provisioning

    I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

    Check the tires

    A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

    Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

    Test Flight Alpha: Initial Cluster Provisioning

    A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

    I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

    Test Flight Beta: Node Replacement

    The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

    During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

    Quick Explanation

    I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

    Back to it..

    As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

    Well, deleting a node returns its IP to the pool, so the process looks something like this:

    • First new node provisioned (IP .45 assigned)
    • First old node deleted (return IP .25 to the pool)
    • Second new node provisioned (IP .25 assigned)

    My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

    In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

    Test Flight Charlie: Changing IP assignments

    I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

    Landing it…

    With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

    I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.

  • A big mistake and a bit of bad luck…

    In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

    Permissions are a dangerous thing

    I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

    I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

    Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

    What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

    I thought I got it back…

    After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

    I was wrong…

    Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

    I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

    What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

    Now what?

    I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

    I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

    So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

    There was a catch…

    Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

    My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

    At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.