Matt Gerega – Page 11

Why I am not using Resharper anymore
I have been a subscriber to JetBrains Resharper for a number of years. Additionally, my employer carries a number of licenses for Resharper, and some teams still use it. But I recently decided to cancel my subscription, and it might be worth a few words as to why.

IDEs are uniquely personal

To developers, IDEs are kind of like jeans: you shop around for a brand you like, then buy every different style of that brand to have a little variety in a known quantity. When I started my development journey, IDEs were tied to the framework you were using: Visual Studio for Microsoft stacks like VB6, VC++, and the .Net Framework, Eclipse for Java, etc. These tools have branched out, and new IDEs have come around for good multi-language support. Techrepublic has a good overview of 12 strong players in the space.

The point is, once you find one you like, you stick with it. Personally, I have always been a fan of Visual Studio and it’s baby brother, Visual Studio Code. The former is my go to when I need a full blown IDE, the latter is great for my “text-only” work, like Powershell scripting, Hashicorp Terraform-ing, and even Python development when I must.

Adding Resharper

I was first introduced to Resharper by a company I consulted. They used it heavily, and, at the time, it had some features that I got used to, so much so that I bought a subscription.
- Ctrl-T search: Resharper’s search functionality made it quick and easy to find classes and enums using Ctrl-T. Sure, a Ctrl-Shift-F search was similar, but the pop-up of Resharper made it less disruptive.
- Refactoring: Resharper’s refactoring capability was head and shoulders above Visual Studio. The refactor preview was helpful in understanding what all you were changing before you did it. This refactoring is probably the feature I would miss the most if I were still working through legacy code. More on that later.
- Code styling and standards: Resharper added a lot of functionality to standardizing code styling, including several rules.
Why stop using Resharper?

Yes, Resharper added a lot of functionality. My choice to stop using it boils down to a combination of a change in needs on my end and the industry playing catch up.

My own changing needs

As a software developer digging in to legacy production code, I spent a lot of time being careful not to break anything. Resharper’s careful refactoring made that easier. Additionally, when Resharper’s styling rules could be enforced through group tools like Sonarqube, it made sense to align on a ruleset for the team.

As I migrated to a higher level technical role, my coding turned into less production and more proof of concept. With that, I need to move more quickly, and I found that Resharper was quite literally slowing me down. There have always been complaints about Resharper affecting Visual Studio speed, and I definitely notice an improvement when Resharper is not installed.

The Industry is catching up

Microsoft has not sat idly by. Visual Studio continues to improve its own search and refactor capabilities. Roslyn allows for code styling and standards to be embedded in your project and checked as part of compilation. So? Well, embedding styling checks in the compiler means we can fail builds when styling checks fail. No more external tools for enforcing code styling.

Additionally, tools like Sonarlint have extended Roslyn to add additional checks. So we can use third party tools to extend our standards without adding much to the build process.

Resharper in a professional environment

I have mixed feelings on using Resharper in a professional environment. I can see the benefits that it provides to individual developers with regard to refactoring, search, and styling. However, its somewhat more difficult to enforce styling across the team, and most code quality tools (like Sonarqube) do not have direct integrations with Resharper.

Resharper provides a new set of CLI tools for executing code analysis and inspections in build pipelines, and it looks like they are trying to bridge the gap. But the relative simplicity of setting up Sonarlint and Sonarqube together allow for tracking of overall code quality across multiple projects, something that I simply cannot see with Resharper today.

The Experiment

Before deciding on whether or not to renew my subscription, I ran an experiment: I disabled Resharper in Visual Studio. I figured I would pretty quickly find where I was missing Resharper’s functionality and be able to determine whether or not it was worth the money to continue my subscription.

To my great surprise, in a 3 month period, not once did I run into a situation where I tried to use a Resharper function that I was missing. My day-to-day coding was unaffected by Resharper being disabled. I would have expected to hit Ctrl-T once and wonder why it nothing came up. It would seem my muscle memory for Resharper was not as strong as I though it was. With that, I cancelled my subscription.

Perhaps I’ll use that extra $100 a year on something new and interesting…
April 12, 2023
Tech Tip – Configuring RKE2 Nginx Ingress using a HelmChartConfig Resource
The RKE2 documentation is there, but, well, it is not quite as detailed as I have seen in other areas. This is a quick tip for customizing your Nginx Ingress controllers when using RKE2

Using Nginx Ingress in RKE2

By default, an RKE2 cluster deploys the nginx-ingress Helm chart. That’s great, except that you may need to customize that chart. This is where the HelmChartConfig resource is used.

RKE2 uses HelmChartConfig custom resource definitions (CRDs) to allow you to set configuration options for their default Helm deployments. This is pretty useful, and seemed straightforward, except I had a hard time figuring out HOW to set the options.

Always easier than I expect

The RKE2 documentation points you to the nginx-ingress chart, but it took me a bit to realize that the connection was as simple as setting the valuesContent value in the HelmChartConfig spec to whatever values I wanted to pass in to Nginx.
```
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-ingress-nginx
  namespace: kube-system
spec:
  valuesContent: |-
    controller:
      config:
        use-forwarded-headers: "true"
        proxy-buffer-size: "256k"
        proxy-buffer-number: "4"
        large-client-header-buffers: "4 16k"
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
          additionalLabels:
            cluster: nonproduction
```
The above sets some configuration values in the controller AND enables metrics collection using the ServiceMonitor object. For Nginx, valid values for valuesContent are the same as values in the chart’s values.yaml file.

Works with other charts

RKE2 provides additional charts that can be deployed and customized with similar methods. There are charts which are deployed by default, and they provide instructions on disabling them. However, the same HelmChartConfig method above can be used to customize the chart installs as well.
April 7, 2023
Literally every time I go away…

It really truly seems like every time I go away, something funny happens in the lab and stuff goes down. That pattern continued last week.

Loads of Travel

A previously planned Jamaican getaway butted up against a short-notice business meeting in Austin. Since I could not get a direct flight to either location, I ended up on 8 flights in 8 days.

The first four were, for lack of a more descriptive word, amazing! My wife and I took a short trip to Couples Swept Away in Negril for a short holiday. While I knew it was an all-inclusive, I did not realize that included scuba diving. I took advantage of that and dove five times in three days. Thankfully, I was always back by noon for a few much needed beach naps.

The Saturday that we were gone, I got a few text messages from myself indicating that my websites were down. Since I got the messages, well, that meant I had power to the Raspberry Pi that runs the monitoring software and my internet was up. I tried logging in, but nothing that was attached to the NAS was coming up… So the NAS was down.

A quick troubleshooting session

Since there was not much I could do from the beach, I put it away until I got home Tuesday night. Now, mind you, I had a flight out early Wednesday morning for Austin, but I really could not sleep until I figured out what was wrong.

As it turns out, we must have had a power brownout/blackout. The NAS went down and did not reboot. I ended up having to unplug it to force a restart.

That was, well, the extent of my troubleshooting: once the NAS came back up, Kubernetes took over and made sure everything was running again.

Move to the cloud?

My home lab is just a lab… When stuff goes down, well, no one should care but me. And that is true for everything except this site. I am torn between spending some money to host this site externally, or fortifying my current setup with some additional redundant systems.

What kinds of fortifications, you might ask… Well, a proper UPS would be nice, avoiding the hard shutdowns that could wreck my systems. But, there’s no way I am going to pay for redundant internet to my house, so I will always be tied to that.

For now, I am ok with hosting as I am today and having to live with some downtime. March was a rough month with a few days of outage, but I generally hit 98-99% uptime, based on the data from my StatusPage site. For a home lab, I would say that is pretty good.

April 6, 2023
Speed. I.. am.. Speed.

I have heard the opening to the Cars movie more times than I can count, and Owen Wilson’s little monologue always sticks in my head. Tangentially, well, I recently logged in to Xfinity to check my data usage, which sent me down a path towards tracking data usage inside of my network. I learned a lot about what to do, and what not to do.

We are using how much data?

Our internet went out on Sunday. This is not a normal occurrence, so I turned off the wifi on my phone and logged in to Xfinity’s website to report the problem. Out of sheer curiosity, after reporting the downtime, I clicked on the link to see our usage.

22 TB… That’s right, 22 terabytes of data in February, and approaching 30TB for March. And there were still 12 days left in March! Clearly, something was going on.

Asking the Unifi Controller

I logged in to the Unifi controller software in the hopes of identifying the source of this traffic. I jumped into the security insights, traffic identification, and looked at the data for the last month. Not one client showed more than 25 GB of traffic in the last month. That does not match what Xfinity is showing at all.

A quick Google search lead me to a few posts that suggest that the Unifi’s automated speed test can boost your data usage, but that it doesn’t show on the Unifi. Mind you, these posts were 4+ years old, but I figured it was worth a shot. So I disabled the speed test in the Unifi controller, but would have to wait a day to see if the Xfinity numbers changed.

Fast forward a day – No change. According to Xfinity I was using something like 500GB of data per day, which is nonsense. My previous months were never higher than 2TB, so using that much data in 4 days means something is wrong.

Am I being hacked?

Thanks to “security first” being beat into me by some of my previous security-focused peers, the first thought in my head was “Am I being hacked?” I looked through logs on the various machines and clusters, trying to find where this data was coming from and why. But nothing looked odd or out of the ordinary: no extra pods running calculations, no servers consuming huge amounts of memory or CPU. So where in the world was 50TB of data coming from?

Unifi Poller to the Rescue

At this point, I remembered that I have Unifi Poller running. The poller grabs data from my Unifi controller and puts it into Mimir. I started poking around the unpoller_ metrics until I found unpoller_device_bytes_total. Looking at that value for my Unifi Security Gateway, well, I saw this graph for the last 30 days:

unpoller_device_bytes_total – Last 30 Days

The scale on the right is bytes, so, 50,000,000,000,000 bytes, or roughly 50TB. Since I am not yet collecting the client DPI information with Unifi Poller, I just traced this data back to the start of this ramp up. It turned out to be February 14th at around 12:20 pm.

GitOps for the win

Since my cluster states are stored in Git repos, any changes to the state of things are logged as commits to the repository, making it very easy to track back. Combing through my commits for 2/14 around noon, I found the offending commit in the speedtest-exporter (now you see the reference to Lightning McQueen).

In an effort to move off of the k8s-at-home charts, which are no longer being maintained, I have switch over to creating charts using Bernd Schorgers’ library chart to manage some of the simple installs. The new chart configured the ServiceMonitor to scrape every minute, which meant, well, that I was running a speed test every minute. Of every day. For a month.

To test my theory, I shut down the speedtest-exporter pod. Before my change, I was seeing 5 and 6 GB of traffic every 30 seconds. With the speed test executing hourly, I am seeing 90-150 MB of traffic every 30 seconds. Additionally, the graph is much more sporadic, which makes more sense: I would expect traffic to increase when my kids are home and watching TV, and decrease at night. What I was seeing was a constant increase over time, which is what pointed me to the speed test. So I fixed the ServiceMonitor to only scrape once an hour, and I will check my data usage in Xfinity tomorrow to see how I did.

My apologies to my neighbors and Xfinity

Breaking this down, I have been using something around 1TB of bandwidth per day over the last month. So, I apologize to my neighbors for potentially slowing everything down, and to Xfinity for running speed tests once a minute. That is not to say that I will stop running speed tests, but I will go back to testing once an hour rather than once a minute.

Additionally, I’m using my newfound knowledge of the Unifi Poller metrics to write some alerts so that I can determine if too much data is coming in and out of the network.

March 21, 2023
Going Banana
That’s right… just one banana. I have been looking to upgrade the Raspberry PI 3 that has been operating as home lab’s reverse proxy. While it would have been more familiar to find another Raspberry Pi 4 to use, their availability is, well, terrible. I found a workable, potentially more appropriate, solution in the Banana Pi M5.

If imitation is the sincerest form of flattery…

Then the Raspberry Pi Foundation should be blushing so much they may pass out. A Google search of “Rasbperry Pi Alternatives 2023” leads to a trove of reviews on various substitutes. Orange Pis, Rock Pis, Banana Pis…. where do I begin?

It is suffice to say that Single Board Computers (SBCs) have taken a huge step forward in the past few years, and many companies are trying to get in the game. It became clear that, well, I needed a requirements list.

Replacing the Pi3 Proxy

I took a few minutes to come up with my requirements list:
- Ubuntu – My Pi3 Proxy has been running Nginx on Ubuntu for over a year. I’m extremely comfortable with the setup that I have, including certbot for automating SSL and the Grafana Agent to report statistics to Mimir. My replacement needs to run Ubuntu, since I have no desire to learn another distro.
- Gigabit Ethernet – The Pi3 does not support true Gigabit ethernet because of the USB throughput. I want to upgrade, since the proxy is handling all of my home lab traffic. Of note, though: I do not need Wifi or Bluetooth support.
- Processor/Memory – The Pi3 runs the 1.4 GHz Quad Core Cortex A53 processor with a whopping 1GB of RAM. Truthfully, the Pi3 handles the traffic well, but an upgrade would be nice. Do I need 8GB of RAM? Again, nice to have: my minimum is 4 GB.
- eMMC – Nginx does a lot of logging, and I worry a bit about the read/write limits on the SD cards. As I did my research, a few of the Pi alternatives have eMMC flash memory onboard. This would be a bit more resilient than an SD card, and should be faster. There are also some hats to support NVMe drives. So, yes, I want some solid memory.
Taking this list of requirements, I started looking around, and one board stood out: the Banana Pi M5.

Not the latest Banana in the bunch

The Banana Pi M5 is not the newest model from Banana Pi. The M6 is their latest offering, and sports a much stronger chipset. However, I had zero luck finding one in stock for a reasonable price. I found a full M5 kit on Amazon for about $120 USD.

The M5’s Cortex-A55 is a small step up from the RPi3 and sports 4GB of RAM, so my processor/memory requirements were met.

Gigabit ethernet? Check. The M5 has no built-in Wifi, but, for what I need it for, I frankly do not care.

Ubuntu? This one was tough to source: their site shows downloads for Ubuntu 20.04 images, but I had to dig around the Internet to verify that someone was able to run a release upgrade to get it to 22.04.

eMMC? A huge 16GB eMMC flash chip. Based on my current usage, this will more than cover my needs.

The M5 looked to be a great upgrade to my existing setup without breaking the bank or requiring me to learn something new. Would it be that easy?

Making the switch

After receiving the M5 (in standard Amazon 2 day fashion), I got to work. The kit included a “built it yourself” case, heatsinks, and a small fan. After a few minutes of trying to figure out how the case went together, I had everything assembled.

Making my way over to the M5 Wiki, I followed the steps on the page. Surprisingly, it really was that simple. I imaged an SD card so I could boot Ubuntu, then followed their instructions for installing the Linux image to EMMC. I ejected the SD Card, rebooted, and I was up and running.

A quick round of apt upgrade and do-release-upgrade later, and I was running Ubuntu 22.04. Installed nginx, certbot, and grafana-agent, copied my configuration files over from the old Pi (changing the hostnames, of course), and I was re-configured in easily under 30 minutes.

The most satisfying portion of this project was, oddly enough, changing the DNS entries and port forwarding rules to hit the M5 and watching the log entries switch places:

The green line is log entries from the M5, the yellow line is log entries from the Pi3. You can see I had some stragglers hitting the old pi, but once everything flushed out, the Pi was no longer in use. I shut it down to give it a little break for now, as I contemplate what to do with it next.

Initial Impressions

The M5 is certainly snappier, although load levels are about the same as were reported by the RPi3. The RPi3 was a rock, always on, always working. I hope for the same with the M5, but, unfortunately, only time will tell.
March 17, 2023
Snakes… Why Did It Have To Be Snakes?
What seems like ages ago, I wrote some Python scripts to keep an eye on my home lab. What I did not realize is that little introduction to Python would help me dive into the wonderful world of ETL (or ELT, more on that later).

Manage By Numbers

I love numbers. Pick your favorite personality profile, and I come out as the cold, calculated, patient person who needs all the information to make a decision. As I ramp up and improve my management skills after a brief hiatus as an individual contributor, I identified a few blinds spots that I wanted to address with my team. But first, I need the data, and that data currently lives in our Jira Cloud instance and an add-on called Tempo Timesheets.

Now, our data teams have started to build out an internal data warehouse for our various departments to collect and analyze data from our various sales systems. They established ELT flow for this warehouse with the following toolsets:
- Singer.io – Establish data extraction and loading
- dbt – Define data transformations
- Prefect – Used to orchestrate flows via Singer Taps & Targets and dbtCore transformations.
Don’t you mean ETL?

There are two primary data integration methods:
- ETL – Extract, Transform, and Load
- ELT – Extract, Load, Transform
At their core, they have the same basic task: get data from one place to another. The difference lies in where data transformations are processed.

In ETL, data is extracted from the source system, transformed, and then loaded into the destination system (data warehouse) where it can be analyzed. In this case, the raw data is not stored within the data warehouse: only the transformed data is available.

ELT, on the other hand, loads raw data directly into the data warehouse, where transformations can be exercised within the data warehouse itself. Since the raw data is stored in the warehouse, multiple transformations can be run without accessing the source system. This allows for data activities such as cleansing, enrichment, and transformation to occur on the same data set with less strain on the source system.

In our case, an ELT transition made the most sense: we will have different transformations for different departments, including the need to perform point-in-time transformations for auditing purposes.

Getting Jira Data

Jira data was, well, the easier part. Singer.io maintains a Jira tap to pull data our of Jira Cloud using Jira’s APIs. “Taps” are connectors to external systems, and Singer has a lot of them. “Targets” are ways to load the data from tap streams into other systems. Singer does not have as many official targets, however, there are a number of open source contributors with additional targets. We are using the Snowflake target to load data into a Snowflake instance.

Our team built out the data flow, but we were missing Tempo data. Tempo presents some REST APIs, so I figured I could use the tap-jira code as a model to build out a custom tap for Tempo data. And that got me back into Python.

Environment Setup

I’m running WSL2 on Windows 11 with Ubuntu 22.04. I finished up my tap development, things seemed to be working fine using virtualenv to isolate, and I had finished testing on my Tempo tap. I wanted to test pushing the data into Snowflake, and upon trying to load the target-snowflake library, I got a number of errors about version incompatibility.

Well hell. As it turns out, most of the engineers at work are using Windows 10/WSL2 with Ubuntu 20.04. With that, they are running Python 3.8. I was running 3.10. A few quick Google searches and I quickly realized that I need a better way to isolate my environments than virtualenv. Along comes a bigger snake…

Anaconda

My Google searches led me to Anaconda. First, I’m extremely impressed that they got that domain name. Second, Anaconda is way more than what I’m using it for: I’m using the environment management, but there is so much more.

I installed Anaconda according to the Linuxhint.com guide and, within about 10 minutes, I had a virtual environment with Python 3.8 that I could use to build and test my taps and targets. The environment management is, in my mind, much easier than using virtualenv: the conda command can be used to list environments and search for available packages, rather than remembering where you stored your virtual environment files and how you activate them.

Not changing careers yet

Sure, I wrote a tap for Tempo data. Am I going to change over to a data architect? Probably not. But at least I can say I succeeded in simple tap development.

A Note on Open Source

I’m a HUGE fan of open source and look to contribute where I can. However, while this tap “works,” it’s definitely not ready for the world. I am only extracting a few of the many objects available in the Tempo APIs. I have no tests to ensure correctness. And, most importantly, I have no documentation to guide users.

Until those things happen, this tap will be locked up behind closed doors. When I find the time to complete it, I will make it public.
March 8, 2023
My 15 pieces of flair… Cloudflare
With parts of my home lab exposed to the internet for my own convenience, it is always good to add layers of protection to incoming traffic. At a colleague’s suggestion, I took Cloudflare up on their free WAF offering to help add some protection to my setup. As a bonus, I have a much better DNS for my domains, which made automating my SSL certificate renewals a snap.

What’s a WAF?

A Web Application Firewall, or WAF, protects your web applications by processing requests and monitoring for attacks. While not a comprehensive solution, it adds a layer of protection.

Cloudflare offers cloud-based services which have a variety of price points. For my hobby/home lab sites, well, free as in beer is the best price point for me. Now, you may notice on the price sheet, that “WAF” is not actually included in the free version. The free edition does not let you define custom firewall rules and block, challenge, or log requests that match those rules.

Well, that’s not 100% accurate: you get 5 active firewall rules with the free plan. Not enough to go crazy, but enough to test if you need it.

And, well, I do not particularly care: For my home lab, the features of most interest to me are the DNS, Caching, DDoS Protection, and the Managed Ruleset.

Basic Content Caching

Cloudflare provides some basic caching on my proxied sites, which definitely helps with sites like WordPress. My PageSpeed insights scores are almost 100 ms faster on mobile devices (down from 310 ms), which is pretty good. While I have never paid too much attention to page load speeds, it is good to know that I can improve some things while adding a layer of protection

DDoS and Managed Rulesets

Truthfully, I have not read up on much of this, and have left the Cloudflare defaults pretty much intact. Cloudflare’s blog does a good job of explaining the Managed Rules, and their documentation covers the DDoS rulesets.

Perhaps if I get bored, or need something to put me to sleep at night, I will start reading up on those rulesets. For now, they are in place, which gives me a little more protection than I had without them.

Cloudflare DNS

Truthfully, if Cloudflare did nothing else than manage my DNS in a way that allowed certbot to automatically renew my Let’s Encrypt certificates, I would have still moved everything over. Prior to the cutover, I was using GoDaddy’s DNS management and, well, it’s a pain. GoDaddy is very good at selling websites, but DNS management is clearly very low on their list. Cloudflare’s DNS, meanwhile, is simple to manage both through their portal and through the APIs.

With my DNS moved over, I revisited the certificates on my internal reverse proxy. Following the instructions from Vineet Choudhary over at developerinsider.co, I updated certbot to renew using the Cloudflare plugin.

Automagic Renewals?

In the past, with certbot-auto, you would have to schedule a cron job to schedule automatic renewals. The new certbot snap, however, uses systemctl and timers to achieve the same. So, with my certificates renewed using the correct plugin, I ran a quick test:
```
sudo certbot renew --dry-run
```
The dry run succeeded without issue. So I checked the timers with the following command:
```
systemctl list-timers
```
Lo and behold, the certbot time is scheduled to run in the middle of the night.

Restarting Nginx on Certbot renewals

There is one small issue: even though I am using the --cert-only option to only get or renew certificates and not edit Nginx, I AM using Nginx as a reverse proxy. Therefore, I need a way to restart Nginx after certbot has done its thing. I found this short article and followed the instructions to edit the /etc/letsencrpyt/cli.ini file with a deploy hook.

The article above noted that, to test, you can run the following:
```
certbot renew --dry-run
```
However, for me, this did NOT trigger the deploy hook. To force triggering the deploy hook, I needed to run this command:
```
sudo certbot renew --dry-run --run-deploy-hooks
```
This command executed the renewal dry run and successfully reloaded Nginx.

Minimal Pieces of Cloudflare

Sure, I have only scratched the surface of Cloudflare’s offerings by adding some free websites and proxying some content. But, as I mentioned, it adds a layer of protection that I did not have before. And, in this day and age, the wire coming into the house presents a bigger security threat than the front door.
March 6, 2023
Badges… We don’t need no stinkin’ badges!
Well… Maybe we do. This is a quick plug (no reimbursement of any kind) for the folks over at Shields.io, who make creating custom badges for readme files and websites an easy and fun task.

A Quick Demo

License for spyder007/MMM-PrometheusAlerts

Build Status for spyder007/MMM-PrometheusAlerts

The badges above are generated from Shields.io. The first link looks like this:
```
https://img.shields.io/github/license/spyder007/MMM-PrometheusAlerts
```
My Github username (spyder007) and the repository name (MMM-PrometheusAlerts) are used in the Image URL, which generates the badge. The second one, build status, looks like this:
```
https://img.shields.io/github/actions/workflow/status/spyder007/MMM-PrometheusAlerts/node.js.yml
```
In this case, my Github username and the repository name remain the same, but node.js.yml is the name of the workflow file for which I want to display the status.

Every badge in Shields.io has a “builder” page that explains how to build the image and even allows you to override styles, colors, and labels, and even add logos from any icon in the Simple Icons collection.

Some examples of alterations to my build status above:

“For the Badge” style, Bugatti Logo with custom color

Flat style, CircleCI logo, Custom label

Too many options to list…

Now, these are live badges, meaning, if my build fails, the green “passing” will go to a red “failing.” Shields.io does this by using the variety of APIs available to gather data about builds, code coverage, licenses, chat, sizes and download counts, funding, issue tracking… It’s a lot. But the beauty of it is, you can create Readme files or websites which have easy to read visuals. My PI Monitoring repository‘s Readme makes use of a number of these shields to give you a quick look at the status of the repo.
March 2, 2023
Building Software Longevity

The “Ship of Theseus” thought experiment is an interesting way to start fights with historians, but in software, replacing old parts with new parts is required for building software longevity. Designing software in ways that every piece can be replaced is vital to building software for the future.

The Question

The Wikipedia article presents the experiment as follows:

According to legend, Theseus, the mythical Greek founder-king of Athens, rescued the children of Athens from King Minos after slaying the minotaur and then escaped onto a ship going to Delos. Each year, the Athenians commemorated this legend by taking the ship on a pilgrimage to Delos to honor Apollo. A question was raised by ancient philosophers: After several centuries of maintenance, if each individual part of the Ship of Theseus was replaced, one at a time, was it still the same ship?
Ship of Theseus – Wikipedia

The Software Equivalent

Consider Microsoft Word. Released in 1983, Word is approaching its 40th anniversary. And, while I do not have access into its internal workings, I am willing to bet that most, if not all of the 1983 code has since been replaced by updated modules. So, while it is still called Word, its parts are much newer than the original 1983 iteration. I am sure if I sat here long enough, I could identify other applications with similar histories. The branding does not change, the core functionality does not change, only the code changes.

Like the wood on the Ship of Theseus, software rots. And it rots fast. Frameworks and languages evolve quickly to take advantage of hardware updates, and the software that uses those must do the same.

Design for Replacement

We use the term technical debt to categorize the conscious choices we make to prioritize time to market over perfect code. It is worthwhile to consider software has a “half life” or “depreciation factor” as well: while your code may work today, chances are, without appropriate maintenance, it will rot into something that is no longer able to serve the needs of your customers.

If I had a “one sized fits all” solution to software rot, I would probably be a millionaire. The truth is, product managers, engineers, and architects must be aware of software rot. Not only must we invest the time into fixing the rot, but we must design our products to allow every piece of the application to be replaced, even as the software still stands.

This is especially critical in Software as a Service (SaaS) offerings, where we do not have the luxury of large downtimes for upgrades or replacements. To our customers, we should operate continuously, with upgrades happening almost invisibly. This requires the foresight to build replaceable components and the commitment to fixing and replacing components regularly. If you cannot replace components of your software, there will come a day where your software will no longer function for your customers.

February 27, 2023
Speeding up Packer Hyper-V Provisioning
I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

Check the tires

A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

Test Flight Alpha: Initial Cluster Provisioning

A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

Test Flight Beta: Node Replacement

The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

Quick Explanation

I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

Back to it..

As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

Well, deleting a node returns its IP to the pool, so the process looks something like this:
- First new node provisioned (IP .45 assigned)
- First old node deleted (return IP .25 to the pool)
- Second new node provisioned (IP .25 assigned)
My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

Test Flight Charlie: Changing IP assignments

I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

Landing it…

With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.
February 22, 2023