Author: Matt

  • Speed. I.. am.. Speed.

    I have heard the opening to the Cars movie more times than I can count, and Owen Wilson’s little monologue always sticks in my head. Tangentially, well, I recently logged in to Xfinity to check my data usage, which sent me down a path towards tracking data usage inside of my network. I learned a lot about what to do, and what not to do.

    We are using how much data?

    Our internet went out on Sunday. This is not a normal occurrence, so I turned off the wifi on my phone and logged in to Xfinity’s website to report the problem. Out of sheer curiosity, after reporting the downtime, I clicked on the link to see our usage.

    22 TB… That’s right, 22 terabytes of data in February, and approaching 30TB for March. And there were still 12 days left in March! Clearly, something was going on.

    Asking the Unifi Controller

    I logged in to the Unifi controller software in the hopes of identifying the source of this traffic. I jumped into the security insights, traffic identification, and looked at the data for the last month. Not one client showed more than 25 GB of traffic in the last month. That does not match what Xfinity is showing at all.

    A quick Google search lead me to a few posts that suggest that the Unifi’s automated speed test can boost your data usage, but that it doesn’t show on the Unifi. Mind you, these posts were 4+ years old, but I figured it was worth a shot. So I disabled the speed test in the Unifi controller, but would have to wait a day to see if the Xfinity numbers changed.

    Fast forward a day – No change. According to Xfinity I was using something like 500GB of data per day, which is nonsense. My previous months were never higher than 2TB, so using that much data in 4 days means something is wrong.

    Am I being hacked?

    Thanks to “security first” being beat into me by some of my previous security-focused peers, the first thought in my head was “Am I being hacked?” I looked through logs on the various machines and clusters, trying to find where this data was coming from and why. But nothing looked odd or out of the ordinary: no extra pods running calculations, no servers consuming huge amounts of memory or CPU. So where in the world was 50TB of data coming from?

    Unifi Poller to the Rescue

    At this point, I remembered that I have Unifi Poller running. The poller grabs data from my Unifi controller and puts it into Mimir. I started poking around the unpoller_ metrics until I found unpoller_device_bytes_total. Looking at that value for my Unifi Security Gateway, well, I saw this graph for the last 30 days:

    unpoller_device_bytes_total – Last 30 Days

    The scale on the right is bytes, so, 50,000,000,000,000 bytes, or roughly 50TB. Since I am not yet collecting the client DPI information with Unifi Poller, I just traced this data back to the start of this ramp up. It turned out to be February 14th at around 12:20 pm.

    GitOps for the win

    Since my cluster states are stored in Git repos, any changes to the state of things are logged as commits to the repository, making it very easy to track back. Combing through my commits for 2/14 around noon, I found the offending commit in the speedtest-exporter (now you see the reference to Lightning McQueen).

    In an effort to move off of the k8s-at-home charts, which are no longer being maintained, I have switch over to creating charts using Bernd Schorgers’ library chart to manage some of the simple installs. The new chart configured the ServiceMonitor to scrape every minute, which meant, well, that I was running a speed test every minute. Of every day. For a month.

    To test my theory, I shut down the speedtest-exporter pod. Before my change, I was seeing 5 and 6 GB of traffic every 30 seconds. With the speed test executing hourly, I am seeing 90-150 MB of traffic every 30 seconds. Additionally, the graph is much more sporadic, which makes more sense: I would expect traffic to increase when my kids are home and watching TV, and decrease at night. What I was seeing was a constant increase over time, which is what pointed me to the speed test. So I fixed the ServiceMonitor to only scrape once an hour, and I will check my data usage in Xfinity tomorrow to see how I did.

    My apologies to my neighbors and Xfinity

    Breaking this down, I have been using something around 1TB of bandwidth per day over the last month. So, I apologize to my neighbors for potentially slowing everything down, and to Xfinity for running speed tests once a minute. That is not to say that I will stop running speed tests, but I will go back to testing once an hour rather than once a minute.

    Additionally, I’m using my newfound knowledge of the Unifi Poller metrics to write some alerts so that I can determine if too much data is coming in and out of the network.

  • Going Banana

    That’s right… just one banana. I have been looking to upgrade the Raspberry PI 3 that has been operating as home lab’s reverse proxy. While it would have been more familiar to find another Raspberry Pi 4 to use, their availability is, well, terrible. I found a workable, potentially more appropriate, solution in the Banana Pi M5.

    If imitation is the sincerest form of flattery…

    Then the Raspberry Pi Foundation should be blushing so much they may pass out. A Google search of “Rasbperry Pi Alternatives 2023” leads to a trove of reviews on various substitutes. Orange Pis, Rock Pis, Banana Pis…. where do I begin?

    It is suffice to say that Single Board Computers (SBCs) have taken a huge step forward in the past few years, and many companies are trying to get in the game. It became clear that, well, I needed a requirements list.

    Replacing the Pi3 Proxy

    I took a few minutes to come up with my requirements list:

    • Ubuntu – My Pi3 Proxy has been running Nginx on Ubuntu for over a year. I’m extremely comfortable with the setup that I have, including certbot for automating SSL and the Grafana Agent to report statistics to Mimir. My replacement needs to run Ubuntu, since I have no desire to learn another distro.
    • Gigabit Ethernet – The Pi3 does not support true Gigabit ethernet because of the USB throughput. I want to upgrade, since the proxy is handling all of my home lab traffic. Of note, though: I do not need Wifi or Bluetooth support.
    • Processor/Memory – The Pi3 runs the 1.4 GHz Quad Core Cortex A53 processor with a whopping 1GB of RAM. Truthfully, the Pi3 handles the traffic well, but an upgrade would be nice. Do I need 8GB of RAM? Again, nice to have: my minimum is 4 GB.
    • eMMC – Nginx does a lot of logging, and I worry a bit about the read/write limits on the SD cards. As I did my research, a few of the Pi alternatives have eMMC flash memory onboard. This would be a bit more resilient than an SD card, and should be faster. There are also some hats to support NVMe drives. So, yes, I want some solid memory.

    Taking this list of requirements, I started looking around, and one board stood out: the Banana Pi M5.

    Not the latest Banana in the bunch

    The Banana Pi M5 is not the newest model from Banana Pi. The M6 is their latest offering, and sports a much stronger chipset. However, I had zero luck finding one in stock for a reasonable price. I found a full M5 kit on Amazon for about $120 USD.

    The M5’s Cortex-A55 is a small step up from the RPi3 and sports 4GB of RAM, so my processor/memory requirements were met.

    Gigabit ethernet? Check. The M5 has no built-in Wifi, but, for what I need it for, I frankly do not care.

    Ubuntu? This one was tough to source: their site shows downloads for Ubuntu 20.04 images, but I had to dig around the Internet to verify that someone was able to run a release upgrade to get it to 22.04.

    eMMC? A huge 16GB eMMC flash chip. Based on my current usage, this will more than cover my needs.

    The M5 looked to be a great upgrade to my existing setup without breaking the bank or requiring me to learn something new. Would it be that easy?

    Making the switch

    After receiving the M5 (in standard Amazon 2 day fashion), I got to work. The kit included a “built it yourself” case, heatsinks, and a small fan. After a few minutes of trying to figure out how the case went together, I had everything assembled.

    Making my way over to the M5 Wiki, I followed the steps on the page. Surprisingly, it really was that simple. I imaged an SD card so I could boot Ubuntu, then followed their instructions for installing the Linux image to EMMC. I ejected the SD Card, rebooted, and I was up and running.

    A quick round of apt upgrade and do-release-upgrade later, and I was running Ubuntu 22.04. Installed nginx, certbot, and grafana-agent, copied my configuration files over from the old Pi (changing the hostnames, of course), and I was re-configured in easily under 30 minutes.

    The most satisfying portion of this project was, oddly enough, changing the DNS entries and port forwarding rules to hit the M5 and watching the log entries switch places:

    The green line is log entries from the M5, the yellow line is log entries from the Pi3. You can see I had some stragglers hitting the old pi, but once everything flushed out, the Pi was no longer in use. I shut it down to give it a little break for now, as I contemplate what to do with it next.

    Initial Impressions

    The M5 is certainly snappier, although load levels are about the same as were reported by the RPi3. The RPi3 was a rock, always on, always working. I hope for the same with the M5, but, unfortunately, only time will tell.

  • Snakes… Why Did It Have To Be Snakes?

    What seems like ages ago, I wrote some Python scripts to keep an eye on my home lab. What I did not realize is that little introduction to Python would help me dive into the wonderful world of ETL (or ELT, more on that later).

    Manage By Numbers

    I love numbers. Pick your favorite personality profile, and I come out as the cold, calculated, patient person who needs all the information to make a decision. As I ramp up and improve my management skills after a brief hiatus as an individual contributor, I identified a few blinds spots that I wanted to address with my team. But first, I need the data, and that data currently lives in our Jira Cloud instance and an add-on called Tempo Timesheets.

    Now, our data teams have started to build out an internal data warehouse for our various departments to collect and analyze data from our various sales systems. They established ELT flow for this warehouse with the following toolsets:

    • Singer.io – Establish data extraction and loading
    • dbt – Define data transformations
    • Prefect – Used to orchestrate flows via Singer Taps & Targets and dbtCore transformations.

    Don’t you mean ETL?

    There are two primary data integration methods:

    • ETL – Extract, Transform, and Load
    • ELT – Extract, Load, Transform

    At their core, they have the same basic task: get data from one place to another. The difference lies in where data transformations are processed.

    In ETL, data is extracted from the source system, transformed, and then loaded into the destination system (data warehouse) where it can be analyzed. In this case, the raw data is not stored within the data warehouse: only the transformed data is available.

    ELT, on the other hand, loads raw data directly into the data warehouse, where transformations can be exercised within the data warehouse itself. Since the raw data is stored in the warehouse, multiple transformations can be run without accessing the source system. This allows for data activities such as cleansing, enrichment, and transformation to occur on the same data set with less strain on the source system.

    In our case, an ELT transition made the most sense: we will have different transformations for different departments, including the need to perform point-in-time transformations for auditing purposes.

    Getting Jira Data

    Jira data was, well, the easier part. Singer.io maintains a Jira tap to pull data our of Jira Cloud using Jira’s APIs. “Taps” are connectors to external systems, and Singer has a lot of them. “Targets” are ways to load the data from tap streams into other systems. Singer does not have as many official targets, however, there are a number of open source contributors with additional targets. We are using the Snowflake target to load data into a Snowflake instance.

    Our team built out the data flow, but we were missing Tempo data. Tempo presents some REST APIs, so I figured I could use the tap-jira code as a model to build out a custom tap for Tempo data. And that got me back into Python.

    Environment Setup

    I’m running WSL2 on Windows 11 with Ubuntu 22.04. I finished up my tap development, things seemed to be working fine using virtualenv to isolate, and I had finished testing on my Tempo tap. I wanted to test pushing the data into Snowflake, and upon trying to load the target-snowflake library, I got a number of errors about version incompatibility.

    Well hell. As it turns out, most of the engineers at work are using Windows 10/WSL2 with Ubuntu 20.04. With that, they are running Python 3.8. I was running 3.10. A few quick Google searches and I quickly realized that I need a better way to isolate my environments than virtualenv. Along comes a bigger snake…

    Anaconda

    My Google searches led me to Anaconda. First, I’m extremely impressed that they got that domain name. Second, Anaconda is way more than what I’m using it for: I’m using the environment management, but there is so much more.

    I installed Anaconda according to the Linuxhint.com guide and, within about 10 minutes, I had a virtual environment with Python 3.8 that I could use to build and test my taps and targets. The environment management is, in my mind, much easier than using virtualenv: the conda command can be used to list environments and search for available packages, rather than remembering where you stored your virtual environment files and how you activate them.

    Not changing careers yet

    Sure, I wrote a tap for Tempo data. Am I going to change over to a data architect? Probably not. But at least I can say I succeeded in simple tap development.

    A Note on Open Source

    I’m a HUGE fan of open source and look to contribute where I can. However, while this tap “works,” it’s definitely not ready for the world. I am only extracting a few of the many objects available in the Tempo APIs. I have no tests to ensure correctness. And, most importantly, I have no documentation to guide users.

    Until those things happen, this tap will be locked up behind closed doors. When I find the time to complete it, I will make it public.

  • My 15 pieces of flair… Cloudflare

    With parts of my home lab exposed to the internet for my own convenience, it is always good to add layers of protection to incoming traffic. At a colleague’s suggestion, I took Cloudflare up on their free WAF offering to help add some protection to my setup. As a bonus, I have a much better DNS for my domains, which made automating my SSL certificate renewals a snap.

    What’s a WAF?

    A Web Application Firewall, or WAF, protects your web applications by processing requests and monitoring for attacks. While not a comprehensive solution, it adds a layer of protection.

    Cloudflare offers cloud-based services which have a variety of price points. For my hobby/home lab sites, well, free as in beer is the best price point for me. Now, you may notice on the price sheet, that “WAF” is not actually included in the free version. The free edition does not let you define custom firewall rules and block, challenge, or log requests that match those rules.

    Well, that’s not 100% accurate: you get 5 active firewall rules with the free plan. Not enough to go crazy, but enough to test if you need it.

    And, well, I do not particularly care: For my home lab, the features of most interest to me are the DNS, Caching, DDoS Protection, and the Managed Ruleset.

    Basic Content Caching

    Cloudflare provides some basic caching on my proxied sites, which definitely helps with sites like WordPress. My PageSpeed insights scores are almost 100 ms faster on mobile devices (down from 310 ms), which is pretty good. While I have never paid too much attention to page load speeds, it is good to know that I can improve some things while adding a layer of protection

    DDoS and Managed Rulesets

    Truthfully, I have not read up on much of this, and have left the Cloudflare defaults pretty much intact. Cloudflare’s blog does a good job of explaining the Managed Rules, and their documentation covers the DDoS rulesets.

    Perhaps if I get bored, or need something to put me to sleep at night, I will start reading up on those rulesets. For now, they are in place, which gives me a little more protection than I had without them.

    Cloudflare DNS

    Truthfully, if Cloudflare did nothing else than manage my DNS in a way that allowed certbot to automatically renew my Let’s Encrypt certificates, I would have still moved everything over. Prior to the cutover, I was using GoDaddy’s DNS management and, well, it’s a pain. GoDaddy is very good at selling websites, but DNS management is clearly very low on their list. Cloudflare’s DNS, meanwhile, is simple to manage both through their portal and through the APIs.

    With my DNS moved over, I revisited the certificates on my internal reverse proxy. Following the instructions from Vineet Choudhary over at developerinsider.co, I updated certbot to renew using the Cloudflare plugin.

    Automagic Renewals?

    In the past, with certbot-auto, you would have to schedule a cron job to schedule automatic renewals. The new certbot snap, however, uses systemctl and timers to achieve the same. So, with my certificates renewed using the correct plugin, I ran a quick test:

    sudo certbot renew --dry-run

    The dry run succeeded without issue. So I checked the timers with the following command:

    systemctl list-timers

    Lo and behold, the certbot time is scheduled to run in the middle of the night.

    Restarting Nginx on Certbot renewals

    There is one small issue: even though I am using the --cert-only option to only get or renew certificates and not edit Nginx, I AM using Nginx as a reverse proxy. Therefore, I need a way to restart Nginx after certbot has done its thing. I found this short article and followed the instructions to edit the /etc/letsencrpyt/cli.ini file with a deploy hook.

    The article above noted that, to test, you can run the following:

    certbot renew --dry-run

    However, for me, this did NOT trigger the deploy hook. To force triggering the deploy hook, I needed to run this command:

    sudo certbot renew --dry-run --run-deploy-hooks

    This command executed the renewal dry run and successfully reloaded Nginx.

    Minimal Pieces of Cloudflare

    Sure, I have only scratched the surface of Cloudflare’s offerings by adding some free websites and proxying some content. But, as I mentioned, it adds a layer of protection that I did not have before. And, in this day and age, the wire coming into the house presents a bigger security threat than the front door.

  • Badges… We don’t need no stinkin’ badges!

    Well… Maybe we do. This is a quick plug (no reimbursement of any kind) for the folks over at Shields.io, who make creating custom badges for readme files and websites an easy and fun task.

    A Quick Demo

    License for spyder007/MMM-PrometheusAlerts
    Build Status for spyder007/MMM-PrometheusAlerts

    The badges above are generated from Shields.io. The first link looks like this:

    https://img.shields.io/github/license/spyder007/MMM-PrometheusAlerts

    My Github username (spyder007) and the repository name (MMM-PrometheusAlerts) are used in the Image URL, which generates the badge. The second one, build status, looks like this:

    https://img.shields.io/github/actions/workflow/status/spyder007/MMM-PrometheusAlerts/node.js.yml

    In this case, my Github username and the repository name remain the same, but node.js.yml is the name of the workflow file for which I want to display the status.

    Every badge in Shields.io has a “builder” page that explains how to build the image and even allows you to override styles, colors, and labels, and even add logos from any icon in the Simple Icons collection.

    Some examples of alterations to my build status above:

    “For the Badge” style, Bugatti Logo with custom color
    Flat style, CircleCI logo, Custom label

    Too many options to list…

    Now, these are live badges, meaning, if my build fails, the green “passing” will go to a red “failing.” Shields.io does this by using the variety of APIs available to gather data about builds, code coverage, licenses, chat, sizes and download counts, funding, issue tracking… It’s a lot. But the beauty of it is, you can create Readme files or websites which have easy to read visuals. My PI Monitoring repository‘s Readme makes use of a number of these shields to give you a quick look at the status of the repo.

  • Building Software Longevity

    The “Ship of Theseus” thought experiment is an interesting way to start fights with historians, but in software, replacing old parts with new parts is required for building software longevity. Designing software in ways that every piece can be replaced is vital to building software for the future.

    The Question

    The Wikipedia article presents the experiment as follows:

    According to legend, Theseus, the mythical Greek founder-king of Athens, rescued the children of Athens from King Minos after slaying the minotaur and then escaped onto a ship going to Delos. Each year, the Athenians commemorated this legend by taking the ship on a pilgrimage to Delos to honor Apollo. A question was raised by ancient philosophers: After several centuries of maintenance, if each individual part of the Ship of Theseus was replaced, one at a time, was it still the same ship?

    Ship of Theseus – Wikipedia

    The Software Equivalent

    Consider Microsoft Word. Released in 1983, Word is approaching its 40th anniversary. And, while I do not have access into its internal workings, I am willing to bet that most, if not all of the 1983 code has since been replaced by updated modules. So, while it is still called Word, its parts are much newer than the original 1983 iteration. I am sure if I sat here long enough, I could identify other applications with similar histories. The branding does not change, the core functionality does not change, only the code changes.

    Like the wood on the Ship of Theseus, software rots. And it rots fast. Frameworks and languages evolve quickly to take advantage of hardware updates, and the software that uses those must do the same.

    Design for Replacement

    We use the term technical debt to categorize the conscious choices we make to prioritize time to market over perfect code. It is worthwhile to consider software has a “half life” or “depreciation factor” as well: while your code may work today, chances are, without appropriate maintenance, it will rot into something that is no longer able to serve the needs of your customers.

    If I had a “one sized fits all” solution to software rot, I would probably be a millionaire. The truth is, product managers, engineers, and architects must be aware of software rot. Not only must we invest the time into fixing the rot, but we must design our products to allow every piece of the application to be replaced, even as the software still stands.

    This is especially critical in Software as a Service (SaaS) offerings, where we do not have the luxury of large downtimes for upgrades or replacements. To our customers, we should operate continuously, with upgrades happening almost invisibly. This requires the foresight to build replaceable components and the commitment to fixing and replacing components regularly. If you cannot replace components of your software, there will come a day where your software will no longer function for your customers.

  • Speeding up Packer Hyper-V Provisioning

    I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

    Check the tires

    A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

    Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

    Test Flight Alpha: Initial Cluster Provisioning

    A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

    I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

    Test Flight Beta: Node Replacement

    The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

    During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

    Quick Explanation

    I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

    Back to it..

    As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

    Well, deleting a node returns its IP to the pool, so the process looks something like this:

    • First new node provisioned (IP .45 assigned)
    • First old node deleted (return IP .25 to the pool)
    • Second new node provisioned (IP .25 assigned)

    My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

    In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

    Test Flight Charlie: Changing IP assignments

    I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

    Landing it…

    With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

    I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.

  • A big mistake and a bit of bad luck…

    In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

    Permissions are a dangerous thing

    I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

    I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

    Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

    What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

    I thought I got it back…

    After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

    I was wrong…

    Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

    I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

    What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

    Now what?

    I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

    I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

    So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

    There was a catch…

    Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

    My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

    At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.

  • Tech Tips – Moving away from k8s-at-home

    Much of what I learned about Helm charting and running workloads in Kubernetes I credit to the contributors over at k8s-at-home. There expansive chart collection helped me start to jump in to Kubernetes.

    Last year, they announced they were deprecating their repositories. I am not surprised: the sheer volume of charts they had meant they had to keep up to date with the latest releases from a number of vendors. If a vendor changed an image or configuration, well, someone had to fix it. That’s a lot of work for a small group with no real benefit other than “doing good for others.”

    Thankfully, one of their contributors, Bernd Schorgers, continues to maintain a library chart that can be used as a basis for most of the charts I use.

    Wanting to move off of the k8s-at-home charts for good, I spent some time this week migrating to Bernd’s library chart. I created new images for the following charts.

    Hopefully one or more of these templates can help move you off of the k8s-at-home charts.

    A Huge Thanks

    I cannot stress this enough: I owe a huge thanks to the k8s-at-home folks. Their work allowed me to jump into Helm by examining what they had done to understand where I could go. I appreciate their contributions to the community: past, present, and future.

  • Automated RKE2 Cluster Management

    One of the things I like about cloud-hosted Kubernetes solutions is that they take the pain out of node management. My latest home lab goal was to replicate some of that functionality with RKE2.

    Did I do it? Yes. Is there room for improvement? Of course, its a software project.

    The Problem

    With RKE1, I have a documented and very manual process for replacing nodes in my clusters. For RKE1, it shapes up like this:

    1. Provision a new VM.
    2. Add a DNS Entry for the new VM.
    3. Edit the cluster.yml file for that cluster, adding the new VM with the appropriate roles to match the outgoing node.
    4. Run rke up
    5. Edit the cluster.yml file for that cluster to remove the old VM.
    6. Run rke up
    7. Modify the cluster’s ingress-nginx settings, adding the new external IP and removing the old one.
    8. Modify my reverse proxy to reflect the IP Changes
    9. Delete the old VM and its DNS entries.

    Repeat the above process for every node in the cluster. Additionally, because the nodes could have slightly different docker versions or updates, I often found myself provisioning a whole set of VMs at a time and going through this process for all the existing nodes at once. The process was fraught with problems, not the least of which is me remembering things that I had to do.

    A DNS Solution

    I wrote a wrapper API to manage Windows DNS settings, and built calls to that wrapper into my Unifi Controller API so that, when I provision a new machine or remove an old one, it will add or remove the fixed IP from Unifi AND add or remove the appropriate DNS record for the machine.

    Since I made DNS entries easier to manage, I also came up with a DNS naming scheme to help manage cluster traffic:

    1. Every control plane node gets an A record with cp-<cluster name>.gerega.net. This lets my kubeconfig files remain unchanged, and traffic is distributed across the control plane nodes via round robin DNS.
    2. Every node gets an A record with tfx-<cluster name>.gerega.net. This allows me to configure my external reverse proxy to use this hostname instead of an individual IP list. See below for more on this from a reverse proxy perspective.

    That solved most of my DNS problems, but I still had issues with the various rke up runs and compatibility worries.

    Automating with RKE2

    The provisioning process for RKE2 is much simpler than that for RKE1. I was able to shift the cluster configuration into the Packer provisioning scripts, which allowed me to do more within the associated Powershell scripts. This, coupled with the DNS standards above, mean that I could run one script and end up with a completely provisioned RKE2 cluster.

    I quickly realized that adding and removing clusters to/from the RKE2 clusters was equally easy. Adding nodes to the cluster simply meant provisioning a new VM with the appropriate scripting to install RKE2 and add it to the existing control plane. Removing nodes from the cluster was simple:

    1. Drain the node (kubectl drain)
    2. Delete the node from the cluster (kubectl delete node/<node name>.
    3. Delete the VM (and its associated DNS).

    As long as I had at least one node with the server role running at all times, things worked fine.

    With RKE2, though, I decided to abandon my ingress-nginx installations in favor of using RKE2’s built-in Nginx Ingress. This allows me to skip managing the cluster’s external IPs, as the RKE cluster’s installer handles that for me.

    Proxying with Nginx

    A little over a year ago I posted my updated network diagram, which introduced a hardware proxy in the form of a Raspberry Pi running Nginx. That little box is a workhorse, and plans are in place for a much needed upgrade. However, in the mean time, it works.

    My configuration was heavily IP based: I would configure upstream blocks with each cluster node’s IP set, and then my sites would be configured to proxy to those IPs. Think something like this:

    upstream cluster1 {
      server 10.1.2.50:80;
      server 10.1.2.51:80;
      server 10.1.2.52:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    The issue here is, every time I add or remove a cluster node, I have to mess with this file. My DNS server is setup for round robin DNS, which means I should be able to add new A records with the same host name, and the DNS will cycle through the different servers.

    My worry, though, was the Nginx reverse proxy. If I configure the reverse proxy to a single DNS, will it cache that IP? Nothing to do but test, right? So I changed my configuration as follows:

    upstream cluster1 {
      server tfx-cluster1.gerega.net:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    Everything seemed to work, but how can I know it worked? For that, I dug into my Prometheus metrics.

    Finding where my traffic is going

    I spent a bit of time trying to figure out which metrics made the most sense to see the number of requests coming through each Nginx controller. As luck would have it, I always put a ServiceMonitor on my Nginx applications to make sure Prometheus is collecting data.

    I dug around in the in the Nginx metrics and found nginx_ingress_controller_requests. With some experimentation, I found this query:

    sum(rate(nginx_ingress_controller_requests{cluster="internal"}[2m])) by (instance)

    Looks easy, right? Basically, look at the sum of the rate of incoming requests by instance for a given time. Now, I could clean this up a little and add some rounding and such, but I really did not care about the number: I wanted to make sure that the request across the instances were balanced effectively. I was not disappointed:

    Rate of Incoming Request

    Each line is an Nginx controller pod in my internal cluster. Visually, things look to be balanced quite nicely!

    Yet Another Migration

    With the move to RKE2, I made more work for myself: I need to migrate my clusters from RKE1 to RKE2. With Argo, the migration should be pretty easy, but still, more home lab work.

    I also came out of this with a laundry list of tech tips and other long form posts… I will be busy over the next few weeks.