Category: Home Lab

  • Kubernetes Observability, Part 1 – Collecting Logs with Loki

    This post is part of a series on observability in Kubernetes clusters:

    I have been spending an inordinate amount of time wrapping my head around Kubernetes observability for my home lab. Rather than consolidate all of this into a single post, I am going to break up my learnings into bite-sized chunks. We’ll start with collecting cluster logs.

    The Problem

    Good containers generate a lot of logs. Outside of getting into the containers via kubectl, logging is the primary mechanism for identifying what is happening within a particular container. We need a way to collect the logs from various containers and consolidate them in a single place.

    My goal was to find a log aggregation solution that gives me insights into all the logs in the cluster, without needing special instrumentation.

    The Candidates – ELK versus Loki

    For a while now, I have been running an ELK (Elasticsearch/Logstash/Kibana) stack locally. My hobby projects utilized an ElasticSearch sink for Serilog to ship logs directly from those images to ElasticSearch. I found that I could install Filebeats into the cluster and ship all container logs to Elasticsearch, which allowed me to gather the logs across containers.

    ELK

    Elasticsearch is a beast. It’s capabilities are quite impressive as a document and index solution. But those capabilities make it really heavy for what I wanted, which was “a way to view logs across containers.”

    For a while, I have been running an ELK instance on my internal tools cluster. It has served its purpose: I am able to report logs from Nginx via Filebeats, and my home containers are built with a Serilog sink that reports logs directly to elastic. I recently discovered how to install Filebeats onto my K8 clusters, which allows it to pull logs from the containers. This, however, exploded my Elastic instance.

    Full disclosure: I’m no Elasticsearch administrator. Perhaps, with proper experience, I could make that work. But Elastic felt heavy, and I didn’t feel like I was getting value out of the data I was collecting.

    Loki

    A few of my colleagues found Grafana Loki as a potential solution for log aggregation. I attempted an installation to compare the solutions.

    Loki is a log aggregation system which provides log storage and querying. It is not limited to Kubernetes: there are number of official clients for sending logs, as well as some unofficial third-party ones. Loki stores your incoming logs (see Storage below), creates indices on some of the log metadata, and provides a custom query language (LogQL) to allow you to explore your logs. Loki integrates with Grafana for visual log exploration, LogCLI for command line searches, and Prometheus AlertManager for routing alerts based on logs.

    One of the clients, Promtail, can be installed on a cluster to scrape logs and report them back to Loki. My colleague suggested a Loki instance on each cluster. I found a few discussions in Grafana’s Github issues section around this, but it lead to a pivotal question.

    Loki per Cluster or One Loki for All?

    I laughed a little as I typed this, because the notion of “multiple Lokis” is explored in decidedly different context in the Marvel series. My options were less exciting: do I have one instance of Loki that collects data from different clients across the clusters, or do I allow each cluster to have it’s own instance of Loki, and use Grafana to attach to multiple data sources?

    Why consider Loki on every cluster?

    Advantages

    Decreased network chatter – If every cluster has a local Loki instance, then logs for that cluster do not have far to go, which means minimal external network traffic.

    Localized Logs – Each cluster would be responsible for storing its own log information, so finding logs for a particular cluster is as simple as going to the cluster itself

    Disadvantages

    Non-centralized – The is no way to query logs across clusters without some additional aggregator (like another Loki instance). This would cause duplicative data storage

    Maintenance Overhead – Each cluster’s Loki instance must be managed individually. This can be automated using ArgoCD and the cluster generator, but it still means that every cluster has to run Loki.

    Final Decision?

    For my home lab, Loki fits the bill. The installation was easy-ish (if you are familiar with Helm and not afraid of community forums), and once configured, it gave me the flexibility I needed with easy, declarative maintenance. But, which deployment method?

    Honestly, I was leaning a bit towards the individual Loki instances. So much so that I configured Loki as a cluster tool and deployed it to all of my instances. And that worked, although swapping around Grafana data sources for various logs started to get tedious. And, when I thought about where I should report logs for other systems (like my Raspberry PI-based Nginx proxy), doubts crept in.

    Thinking about using an ELK stack, I certainly would not put an instance of Elasticsearch on every cluster. While Loki is a little lighter than elastic, it’s still heavy enough that it’s worthy of a single, properly configured instance. So I removed the cluster-specific Loki instances and configured one instance.

    With promtail deployed via an ArgoCD ApplicationSet with a cluster generator, I was off to the races.

    A Note on Storage

    Loki has a few storage options, with the majority being cloud-based. At work, with storage options in both Azure and GCP, this is a non-issue. But for my home lab, well, I didn’t want to burn cash storing logs when I have a perfectly good SAN sitting at home.

    My solution there was to stand up an instance of MinIO to act as S3 storage for Loki. Now, could I have run MinIO on Kubernetes? Sure. But, in all honesty, it got pretty confusing pretty quickly, and I was more interested in getting Loki running. So I spun up a Hyper-V machine with Ubuntu 22.04 and started running MinIO. Maybe one day I’ll work on getting MinIO running on one of my K8 clusters, but for now, the single machine works just fine.

  • An Impromptu Home Lab Disaster Recovery Session

    It has been a rough 90 days for my home lab. We have had a few unexpected power outages which took everything down. And, for the unexpected outages, things came back up. Over the weekend, I was doing some electrical work outside, wiring up outlets and lighting. Being safety conscious, I killed the power to the breaker I was tying into inside, not realize it was the same breaker that the server was on. My internal dialog went something like this:

    • “Turn off breaker Basement 2”
    • ** clicks breaker **
    • ** Hears abrupt stop of server fans **
    • Expletive….

    When trying to recover from that last sequence, I ran into a number of issues.

    • I’m CRUSHING that server when it comes back up: having 20 VMs attempting to start simultaneously is causing a lot of resource contention.
    • I had to run fsck manually on a few machines to get them back up and running.
    • Even after getting the machines running, ETCD was broken on two of my four clusters.

    Fixing my Resource Contention Issues

    I should have done this from the start, but all of my VMs had their Automatic Start Action set to Restart if previously running. That’s great in theory, but, in practice, starting 20 or so VMs on the same hypervisor is not recommended.

    Part of Hyper-V’s Automatic Start Action panel is an Automatic Startup Delay. In Powershell, it is the AutomaticStartDelay property on the VirtualMachine object (what’s returned from a Get-VM call). My ultimate goal is to set that property to stagger start my VMs. And I could have manually done that and been done in a few minutes. But, how do I manage that when I spin up new machines? And can I store some information on the VM to reset that value as I play around with how long each VM needs to start up?

    Groups and Offsets

    All of my VMs can be grouped based on importance. And it would have been easy enough to start 2-3 VMs in group 1, wait a few minutes, then do group 2, etc. But I wanted to be able to assign offsets within the groups to better address contention. In an ideal world, the machines would come up sequentially to a point, and then 2 or 3 at a time after the main VMs have started. So I created a very simple JSON object to track this:

    {
      "startGroup": 1,
      "delayOffset": 120
    }

    There is a free-text Notes field on the VirtualMachine object, so I used that to set a startGroup and delayOffset for each of my VMs. Using a string of Powershell commands, I was able to get a tabular output of my custom properties:

    get-vm | Select Name, State, AutomaticStartDelay, @{n='ASDMin';e={$_.AutomaticStartDelay / 60}}, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} | Sort-Object AutomaticStartDelay | format-table
    • Get-VM – Get a list of all the VMs on the machine
    • Select Name, ... – The Select statement (alias to Select-Object) pulls values form the object. There are two calculated properties that pull values from the Notes field as a JSON object.
    • Sort-Object – Sort the list by the AutomaticStartDelay property
    • Format-Table – Format the response as a table.

    At that point, the VM had its startGroup and delayOffset, but how can I set the AutomaticStartDelay based on those? More Powershell!!

    get-vm | Select Name, State, AutomaticStartDelay, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} |? {$_.startGroup -gt 0} | % { set-vm -name $_.name -AutomaticStartDelay ((($_.startGroup - 1) * 480) + $_.delayOffset) }

    The first two commands are the same as the above, but after that:

    • ? {$_.startGroup -gt 0} – Use Where-Object (? alias) to select VMs with a startGroup value
    • % { set-vm -name ... }ForEach-Object (% alias) in that group, set the AutomaticStartDelay.

    In the command above, I hard-coded the AutomaticStartDelay to the following formula:

    ((startGroup - 1) * 480) + delayOffset

    With this formula, the server will wait 4 minutes between groups, and add a delay within the group should I choose. As an example, my domain controllers carry the following values:

    # Primary DC
    {
      "startGroup": 1,
      "delayOffset": 0
    }
    # Secondary DC
    {
      "startGroup": 1,
      "delayOffset": 120
    }

    The calculated delay for my domain controllers is 0 and 120 seconds, respectively. The next group won’t start until 480 seconds (4 minutes), which gives my DCs 4 minutes on their own to boot up.

    Now, there will most likely be some tuning involved in this process, which is where my complexity becomes helpful: say I can boot 2-3 machines every 3 minutes… I can just re-run the population command with a new formula.

    Did I over-engineer this? Probably. But the point is, use AutomaticStartDelay if you are running a lot of VMs on a Hypervisor.

    Restoring ETCD

    Call it fate, but that last power outage ended up causing ETCD issues in two of my servers. I had to run fsck manually on a few of my servers to repair the file system. Even when the servers were up and running, two of my clusters had problems with their ETCD services.

    In the past, my solution to this was “nuke the cluster and rebuild,” but I am trying to be a better Kubernetes administrator, so this time, I took the opportunity to actually read the troubleshooting documentation that Rancher provides.

    Unfortunately, I could not get past “step one:” ETCD was not running. Knowing that it was most likely a corruption of some kind and that I had a relatively up-to-date ETCD snapshot, I did not burn too much time before going to the restore.

    rke etcd snapshot-restore --name snapshot_name_etcd --config cluster.yml

    That command worked like a charm, and my clusters we back up and running.

    To Do List

    I have a few things on my to-do list following this adventure:

    1. Move ETCD snapshots off of the VMs and onto the SAN. I would have had a lot of trouble bringing ETCD back up if those snapshots were not available because the node they were on went down.
    2. Update my Packer provisioning scripts to include writing the startup configuration to the VM notes.
    3. Build an API wrapper that I can run on the server to manage the notes field.

    I am somewhat interested in testing how the AutomaticStartDelay changes will affect my server boot time. However, I am planning on doing that on a weekend morning during a planned maintenance, not on a random Thursday afternoon.

  • Nginx Reverse proxy: A slash makes all the difference.

    I have been doing some work to build up some standard processes for Kubernetes. ArgoCD has become a big part of that, as it allows us to declaratively manage the state of our clusters.

    After recovering from a small blow-up in the home lab (post coming), I wanted to migrate my cluster tools to utilize the label selector feature of the ApplicationSet’s Cluster Generator. Why? It allows me to selectively manage tools in the cluster. After all, not every cluster needs the nfs-external-provisioner to provide a StorageClass for pod file storage.

    As part of this, I wanted to deploy to tools to the local Argo cluster. In order to do that, the local cluster needs a secret. I tried to follow the instructions, but when I clicked to view the cluster details, I got a 404 error. I dug around the logs, and my request wasn’t even getting to the Argo application server container.

    When I looked at the Ingress controller logs, it showed the request looked something like this:

    my.url.com/api/v1/clusters/http://my.clusterurl.svc

    Obviously, that’s not correct. The call coming from the UI is this:

    my.url.com/api/v1/clusters/http%3A%2F%2Fmy.clusterurl.svc

    Notice the encoding: My Nginx reverse proxy (running on a Raspberry Pi outside my main server) was decoding the request before passing it along to the cluster.

    The question was, why? A quick Google search lead right to their documentation:

    • If proxy_pass is specified with URI, when passing a request to the server, part of a normalized request URI matching the location is replaced by a URI specified in the directive
    • If proxy_pass is specified without URI, a request URI is passed to the server in the same form as sent by a client when processing an original request

    What? Essentially, it means the slash at the end will dictate whether Nginx does anything with the request.

    ## With a slash at the end, the client request is normalized
    location /name/ {
        proxy_pass http://127.0.0.1/remote/;
    }
    
    ## Without a slash, the request is as-is
    location /name/ {
        proxy_pass http://127.0.0.1;
    }

    Removing the slash, the Argo UI was able to load the cluster details correctly.

  • The cascading upgrade continues…

    I mentioned in a previous post that I made the jump to Windows 11 on both my work and home laptops. As it turns out, this is causing me to re-evaluate some of my other systems and upgrade them as needed.

    Old (really old) Firmware

    The HP ProLiant Gen8 Server I have been running for a few years had a VERY old firmware version on it. When I say “very old” I mean circa 2012. For all intents and purposes, it did its job. The few times I use it, however, are the most critical: i.e., the domain controller VMs won’t come up, and I need to use the remote console to log in to the server and restart something.

    This particular version of the iLO firmware worked best in Internet Explorer, particularly for the remote access portion. Additionally, I had never taken the time to create a proper SSL certificate for the iLO interface, which usually meant a few refreshes were required to get in.

    In Windows 11 and Edge, this just was not possible. The security settings prevented the access on the invalid SSL. Additionally, remote access required a Click-once application running .NET Framework 3.5. So even if I got past the invalid SSL (which I did), the remote console would not work.

    Time for an upgrade?

    When I first setup this server in 2018, I vaguely remember looking for and not finding firmware updates for the iLO. Clearly I was mistaken: the Gen8 runs iLO 4, which has firmware updates as recent as April of 2022. After reading through the release notes and installation instructions, I felt pretty confident that this upgrade would solve my issue.

    The upgrade process was pretty easy: extract the .bin firmware from the installer, upload via the iLO interface, and wait a bit for the iLO to restart. At that point, I was up and running with a new and improved interface.

    Solving the SSL Issue

    The iLO generates a self-signed, but backdated SSL certificate. You can customize it, buy only by generating a CSR via the iLO interface, getting a certificate back, and importing that certificate into the iLO. I really did not want to go through the hassle of create a certificate authority, or figure out how to use Let’s Encrypt to fulfill the CSR, so I took a slightly different path.

    1. Generate a self-signed root CA Certificate.
    2. Generate the CSR from the iLO interface
    3. Sign the CSR with the self-signed root CA.
    4. Install the root CA as a Trusted Root Certificate on my local machine.

    This allows me the ability to connect to the iLO interface without getting the SSL errors, which is enough for me.

    A lot has changed in 10 years…

    The iLO 4 interface got a nice facelift in the past 10 years. A new REST API lets me get to some data from my server, including power and thermal data. Most importantly, the Remote Console got an upgrade to an HTML 5 interface, which means I do not have to rely on Internet Explorer anymore. I am pretty happy with the ease of the process, although I do wish I would have known and done this sooner.

  • Breaking an RKE cluster in one easy step

    With the release of Ubuntu’s latest LTS release (22.04, or “Jammy Jellyfish), I wanted to upgrade my Kubernetes nodes from 20.04 to 22.04. What I had hoped would be an easy endeavor turned out to be a weeks-long process with destroyed clusters and, ultimately, an ETCD issue.

    The Hypothesis

    As I viewed it, I had two paths to this upgrade: in-place upgrade on the nodes, or bring up new nodes and decommission the old ones. As the latter represents the “Yellowstone” approach, I chose that one. My plan seemed simple:

    • Spin up new Ubuntu 22.04 nodes using Packer.
    • Add the new nodes to the existing clusters, assigning the new nodes all the necessary roles (I usually have 1 control_plane, 3 etcd, and all are worker
    • Remove the control_plane from the old node and verify connectivity
    • Remove the old nodes (cordon, drain, and remove)

    Internal Cluster

    After updating my Packer scripts for 22.04, I spun up new nodes for my internal cluster, which has an ELK stack for log collection. I added the new nodes without a problem, and thought that maybe I could combine the last two steps and just remove all the nodes at the same time.

    That ended up with the Rancher CLI getting stuck in checking ETCD health. I may have gotten a little impatient and just killed the Rancher CLI process mid-job. This left me with, well, a dead internal cluster. So, I recovered the cluster (see my note on cluster recovery below) and thought I’d try again with my non-production cluster.

    Non-production Cluster

    Some say that the definition of insanity is doing the same thing and expecting a different result. My logic, however, was that I made two mistakes the first time through:

    • Trying to remove the controlplane alongside of the etcd nodes in the same step
    • Killing the RKE CLI command mid-stream

    So I spun up a few new non-production nodes, added them to the cluster, and simply removed controlplane from the old node.

    Success! My new controlplane node took over, and cluster health seemed good. And, in the interest of only changing one variable at a time, I decided to try and remove just one old node from the cluster.

    Kaboom….

    Same issue in recovering the etcd volume. So I recovered the cluster and returned to the drawing board.

    Testing

    At this point, I only had my Rancher/Argo cluster and my production cluster, which houses, among other things, this site. I had no desire for wanton destruction of these clusters, so I setup a test cluster to see if I could replicate my results. I was able to, at which point I turned to the RKE project in Github for help.

    After a few days, someone pointed me to a relatively new Rancher issue describing my predicament. If you read through those various issues, you’ll find that etcd 3.5 has an issue where node removal can corrupt its database, causing issues such as mine. The issue was corrected in 3.5.3.

    I upgraded my RKE CLI and ran another test with the latest rancher Kubernetes version. This time, finally, success! I was able to remove etcd nodes without crashing the cluster.

    Finishing up / Lessons Learned

    Before doing anything, I upgraded all of my clusters to the latest supported Kubernetes version. In my case, this is v1.23.6-rancher1-1. Following the steps above, I was, in fact, able to progress through upgrading both my Rancher/Argo cluster and my production cluster without bringing down the clusters.

    Lessons learned? Well, patience is key (don’t kill cluster management processes mid-effort), but also, sometimes it is worth a test before you try things. Had any of these clusters NOT been my home lab clusters, this process, seemingly simple, would have incurred downtime in more important systems.

    A note on Cluster Recovery

    For both the internal and non-production clusters, I could have scrambled to recover the ETCD volume for that cluster and brought it back to life. But I realized that there was no real valuable data in either cluster. The ELK logs are useful real-time but I have not started down the path of analyzing history, so I didn’t mind losing them. And even those are on my SAN, and the PVCs get archived when no longer in use.

    Instead of a long, drawn out recovery process, I simply stood up brand new clusters, pointed my instance of Argo to them and updated my Argo applications to deploy to the new cluster. Inside of an hour, my apps were back up and running. This is something of a testament to the benefits of storing a cluster’s state in a repository: recreation was nearly automatic.

  • GitOps: Moving from Octopus to … Octopus?

    Well, not really, but I find it a bit cheeky that ArgoCD’s icon is, in fact, an orange octopus.

    There are so many different ways to provision and run Kubernetes clusters that, without some sense of standardization across the organization, Kubernetes can become an operations nightmare. And while a well-run operations environment can allow application developers to focus on their applications, a lack of standards and best practices for container orchestration can divert resources from application development to figuring out how a particular cluster was provisioned or deployed.

    I have spent a good bit of time at work trying to consolidate our company approach to Kubernetes. Part of this is working with our infrastructure teams to level set on what a “bare” cluster looks like. The other part is standardizing how the clusters are configured and managed. For this, a colleague turned me onto GitOps.

    GitOps – Declarative Management

    I spent a LOT of time just trying to understand how all of this works. For me, the thought of “well, in order to deploy I have to change a Git repo” seemed, well, icky. I like the idea that code changes, things are built, tested, and deployed. I had a difficult time separating application code from declarative state. Once I removed that mental block, however, I was bound and determined to move my home environments from Octopus Deploy pipelines to ArgoCD.

    I did not go on this journey alone. My colleague Oleg introduced me to ArgoCD and the External Secrets operator. I found a very detailed article with flow details by Florian Dambrine on Medium.com. Likewise, Burak Kurt showed me how to have Argo manage itself. Armed with all of this information, I started my journey.

    The “short short” version

    In an effort to retain the few readers I have, I will do my best to consolidate all of this into a few short paragraphs. I have two types of applications in my clusters: external tools and home-grown apps.

    External Tools

    Prior to all of this, my external tools were managed by running Helm upgrades periodically with values files saved to my PC. These tools included “cluster tools” which are installed on every cluster, and then things like WordPress and Proget for running the components of this site. Needless to say, this is highly dangerous: had I lost my PC, all those values files would be gone.

    I now have tool definitions stored either in the cluster’s specific Git repo, or in a special “cluster-tools” repository that allows me to define applications that I want installed to all (or some) of my clusters, based on labels in the cluster’s Argo secret. This allows me to update tools by updating the version in my Git repository and committing the change.

    It should be noted that, for these tools, Helm is still used to install/upgrade. More on this later.

    Home Grown Applications

    The home-grown apps had more of a development feel: feature branches push to a test environment, while builds from main get pushed to staging and then, upon my approval in Azure DevOps, pushed to production.

    Previous to conversion, every build produced a new container image and Helm chart, both of which were published to Proget. From there, Octopus Deploy took care of deploying feature builds to the test environment only, and took care of deploying to stage and production based on nudges from Azure DevOps Tasks.

    Using Florian’s described flow, I created a Helmfile repo for each of my projects, which allowed me consolidate the application charts into a single repository. Using Helmfile and Helm, I generate manifests that are then committed into the appropriate cluster’s Git repository. Each Helmfile repository has its own pipeline for generating manifests and committing them to the cluster repository, but my individual project pipelines have gotten very simple: build a new version, and change the Helmfile repository to reflect the new image version.

    Once the cluster’s repository is updated, Argo takes care of syncing the resources.

    Helm versus Manifests

    I noted that I’m currently using Helm for external tools versus generating manifests (albeit from Helm charts, but still generating manifests) for internal applications. As long as you never actually run a helm install command, then Argo will manage the application using manifests generated from the Helm application. However, from what I have seen, if you have previously run helm install, that application hangs around in the cluster. However, that applications history doesn’t change with new versions. So you can get into an odd state where helm list will show older versions than what are actually installed.

    When using a tool like ArgoCD, you want to let it manage your applications from beginning to end. It will keep your cluster cleaner. For the time being, I am defining external tools using Helm templates, but using Helmfile to expand my home-built applications into manifest files.

  • ISY and the magic network gnomes

    For nearly 2 years, I struggled mightily with communication issues between my ISY 994i and some of my docker images and servers. So much, in fact, that I had a fairly long running post in the Universal Devices forums dedicated to the topic.

    I figure it is worth a bit of a rehash here, if only to raise the issue in the hopes that some of my more network-experienced contacts can suggest a fix.

    The Beginning

    The initial post was essentially about my ASP.NET Core API (.net Core 2.2 at the time) not being able to communicate with the ISY’s REST API. You can read through the initial post for details, but, basically, it would hit it once, then timeout on subsequent requests.

    It would seem that some time between my original post and the administrator’s reply, I set the container’s networking to host and the problem went away.

    In retrospect, I had not been heavily using that API anyway, so it may have just been hidden a bit better by the host network. In any case, I ignored it for a year.

    The Return

    About twenty (that’s right, 20) months later, I started moving my stuff to Kubernetes, and the issue reared its ugly head. I spent a lot of time trying to get some debug information from the ISY, which only confused me more.

    As I dug more into when it was happening, it occurred to me that I could not reliably communicate with the ISY from any of the VMs on my HP Proliant server. Also, and, more puzzling, I could not do a port 80 retrieval from the server itself to the ISY. Oddly, though, I’m able to communicate with other hardware devices on the network (such as my MiLight Gateway) from the server and it’s VMs. Additionally, the ISY responds to pings, so it is able to be reached.

    Time for a new proxy

    Now, one of the VMs on my server was an Ubuntu VM that was serving as an NGINX reverse proxy. For various reasons, I wanted to move that from a virtual machine to a physical box. This, it would seem, would be a good time to see if a new proxy leads to different results.

    I had an old Raspberry Pi 3B+ lying around, and that seemed like the perfect candidate for a stand alone proxy. So I did a quick image of an SD card with Ubuntu 20, copied my Nginx configuration files from the VM to the Pi, and re-routed my firewall traffic to the proxy.

    Not only did that work, but it solved the issue of ISY connectivity. Routing traffic through the PI, I am able to communicate with the ISY reliably from my server, all of my VMs, and other PCs on the network.

    But, why?

    Well, that is the million dollar question, and, frankly, I have no clue. Perhaps it has to do with the NIC teaming on the server, or some oddity in the network configuration on the server. But I burned way too many hours on it to want to dig more into it.

    You may be asking, why a hardware proxy? I liked the reliability and smaller footprint of a dedicated Raspberry PI proxy, external to the server and any VMs. It made the networking diagram much simpler, as traffic now flows neatly from my gateway to the proxy and then to the target machine. It also allows me to control traffic to the server in a more granular fashion, rather than having ALL traffic pointed to a VM on the server, and then routed via proxy from there.

  • Lab time is always time well spent.

    The push to get a website out for my wife brought to light my neglect of my authentication site, both in function and style. As one of the ongoing projects at work has been centered around Identity Server, a refresher course would help my cause. So over the last week, I dove into my Identity Server to get a better feel for the function and to push some much needed updates.

    Function First

    I had been running Identity Server 4 in my local environments for a while. I wanted a single point of authentication for my home lab, and since Identity Server has been in use at my company for a few years now, it made sense to continue that usage in my lab environment.

    I spent some time over the last few years expanding the Quickstart UI to allow for visual management of the Identity Server. This includes utilizing the Entity Framework integration for the configuration and operational stores and the ASP.NET Identity integration for user management. The system works exactly as expected: I can lock down APIs or UIs under single authentication service, and I can standardize my authentication and authorization code. It has resulted in way more time for fiddling with new stuff.

    The upgrade from Identity Server 4 to Identity Server 5 (now under Duende’s Identity Server) was very easy, thanks to a detailed upgrade guide from the folks at Duende. And, truthfully, I left it at that for a few months. Then I got the bug to use Google as an external identity provider. After all, what would be nicer than to not even have to worry about remembering account passwords for my internal server?

    When I pulled the latest quick start for Duende, though, a lot had changed. I spent at least a few hours moving changes in the authorization flow to my codebase. The journey was worth it, however: I now have a working authentication provider which is able to use Google as an identity provider.

    Style next

    As I was messing around with the functional side, I got pretty tired of the old CoreUI theme I had applied some years ago. I went looking for a Bootstrap 5 compatible theme, and found one in PlainAdmin. It is simple enough for my identity server, but a little more crisp than the old Core UI.

    As is the case with many a software project, the functional changes took a few hours, the visual changes, quite a bit longer. I had to shuffle the layout to better conform with PlainAdmin’s templates, and I had a hell of a time stripping out old scss to make way for the new scss. I am quite sure my app is still bloated with unnecessary styles, but cleaning that up will have to wait for another day.

    And finally a logo…

    What’s a refresh without a new logo?

    A little background: spyder007, in some form or another, has been my online nickname/handle since high school. It is far from unique, but it is a digital identity I still use. So Spydersoft seemed like a natural name for private projects.

    I started with a spider logo that I found with a good creative commons license. But I wanted something a little more, well, plain. So I did some searching and found a “web” design that I liked with a CC license. I edited it to be square, which had the unintended affect of making the web look like a cube… and I liked it.

    The new Spydersoft logo

    So with about a week’s worth of hobby-time tinkering, I was able to add Google as an identity provider and re-style my authentication site. As I’m the only person that sees that sight, it may seem a bit much, but the lessons learned, particularly around the identity server functional pieces, have already translated into valuable insights for the current project at work.

  • Packer.io : Making excess too easy

    As I was chatting with a former colleague the other day, I realized that I have been doing some pretty diverse work as part of my home lab. A quick scan of my posts in this category reveal a host of topics ranging from home automation to Python monitoring to Kubernetes administration. One of my colleague’s questions was something to the effect of “How do you have time to do all of this?”

    As I thought about it for a minute, I realized that all of my Kubernetes research would not have been possible if I had not first taken the opportunity to automate the process of provisioning Hyper-V VMs. In my Kubernetes experimentation, I have easily provisioned 35-40 Ubuntu VMs, and then promptly broken two-thirds of them through experimentation. Thinking about taking the time to install Ubuntu and provision it before I can start work, well, that would have been a non-starter.

    It started with a build…

    In my corporate career, we have been moving more towards Azure DevOps and away from TeamCity. To date, I am impressed with Azure DevOps. Pipelines-as-code appeals to my inner geek, and not having to maintain a server and build agents has its perks. I had visions of migrating from TeamCity to Azure DevOps, hoping I could take advantage of Microsoft’s generosity with small teams. Alas, Azure DevOps is free for small teams ONLY if you self host your build agents, which meant a small amount of machine maintenance.. I wanted to be able to self-host agents with the same software that Microsoft uses for their Github Actions/Azure DevOps agents. After reading through the Github Virtual Environments repository, I determined it was time to learn Packer.

    The build agents for Github/Azure Devops are provisioned using Packer. My initial hope was that I would just be able to clone that repository, run packer, and viola! It’s never that easy. The Packer projects in that repository are designed to provision VM images that run in Azure, not on Hyper-V. Provisioning Hyper-V machines is possible through Packer, but requires different template files and some tweaking of the provisioning scripts. Without getting too much into the weeds, Packer uses different builders for Azure and Hyper-V. So I had to grab all the provisioning scripts I wanted from the template files in the Virtual Environments repository, but configure a builder for Hyper-V. Thankfully, Nick Charlton provided a great starting point for automating Ubuntu 20.04 installs with Packer. From there, I was off to the races.

    Enabling my excess

    Through probably 40 hours of trial and error, I got to the point where I was building my own build agents and hooking them up to my Azure DevOps account. It should be noted that fully provisioning a build agent takes six to eight hours, so most of that 40 hours was “fire and forget.” With that success, I started to think: “Could I provision simpler Ubuntu servers and use those to experiment with Kubernetes?”

    The answer, in short, is “Of course!” I went about creating some Powershell scripts and Packer templates so that I could provision various levels of Ubuntu servers. I have shared those scripts, along with my build agent provisioning scripts, in my provisioning-projects repository on Github. With those scripts, I was off to the races, provisioning new machines at will. It is remarkable the risks you will take in a lab environment, knowing that you are only 20-30 minutes away from a clean machine should you mess something up.

    A note on IP management

    If you dig into the repository above, you may notice some scripts and code around provisioning a MAC address from a “Unifi IP Manager.” I created a small API wrapper that utilizes the Unifi Controller APIs to create clients with fixed IP addresses. The API generates a random, but valid, MAC Address for Hyper-V, then uses the Unifi Controller API to assign a fixed IP.

    That project isn’t quite ready for public consumption, but if you are interested, drop a comment on this post.

  • Home Lab: Disaster Recovery and time for an Upgrade!

    I had the opportunity to travel the U.S. Virgin Islands last week on a “COVID delayed” honeymoon. I have absolutely no complaints: we had amazing weather, explored beautiful beaches, and got a chance to snorkel and scuba in some of the clearest water I have seen outside of a pool.

    Trunk Bay, St. John, US VI

    While I was relaxing on the beach, Hurricane Ida was wrecking the Gulf and dropping rain across the East, including at home. This lead to power outages, which lead to my home server having something of a meltdown. And in this, I learned the value of good neighbors who can reboot my server and the cost of not setting up proper disaster recovery in Kubernetes.

    The Fall

    I was relaxing in my hotel room when I got text messages from my monitoring solution. And, at first, I figured “The power went out, things will come back up in 30 minutes or so.” But, after about an hour, nothing. So I texted my neighbor and asked if he could reset the server. After he reset, most of my sites came back up, with the exception of some of my SQL-dependent sites. I’ve had some issues with SQL Server’s not starting their service correctly, so some sites were down… but that’s for another day.

    A few days later, I got the same monitoring alerts. My parents were house-sitting, so I had my mom reset the server. Again, most of my sites came up. Being in America’s Caribbean Paradise, I promptly forgot all about it, figuring things were fine.

    The Realization

    Boy, was I wrong. When I sat down at the computer on Sunday, I randomly checked my Rancher instance. Dead. My other clusters were running, but all of the clusters were reporting issues with etcd on one of the nodes. And, quite frankly, I was at a loss. Why?

    • While I have taken some Pluralsight courses on Kubernetes, I was a little overly dependent on the Rancher UI to manage Kubernetes. With it down, I was struggling a bit.
    • I did not take the time to find and read the etcd troubleshooting steps for RKE. Looking back, I could most likely have restored the etcd snapshots and been ok. Live and learn, as it were.

    Long story short, my hacking attempts to get things running again pretty much killed my rancher cluster and made a mess of my internal cluster. Thankfully, the stage and production clusters were still running, but with a bad etcd node.

    The Rebuild

    At this point, I decided to throw in the towel and rebuild the Rancher cluster. So I deleted the existing rancher machines and provisioned a new cluster. Before I installed Rancher, though, I realized that my old clusters might still try to connect to the new Rancher instance and cause more issues. I took the time to remove the Rancher components from the stage and production clusters using the documented scripts.

    When I did this with my internal tools cluster… well, it made me realize there was a lot of unnecessary junk on that cluster. It was only running ELK (which was not even fully configured) and my Unifi Controller, which I moved to my production box. So, since I was already rebuilding, I decided to decommission my internal tools box and rebuild that as well.

    With two brand new clusters and two RKE clusters clean of Rancher components, I installed Rancher and got all the management running.

    The Lesson

    From this little meltdown and reconstruction I have learned a few important lessons.

    • Save your etcd snapshots off machine – RKE takes snapshots of your etcd regularly, and there is a process for restoring from a snapshot. Had I known that those snapshots were there, I would have tried this before killing my cluster.
    • etcd is disk-heavy and “touchy” when it comes to latency – My current setup has my hypervisor using my Synology as an iSCSI disk for all my VMs. With 12 VMs running as Kubernetes nodes, any disruption in either network or I/O lag can cause leader changes. This is minimal during normal operation, but performing deployments or updates can sometimes cause issues. I have a small upgrade planned for the Synology to add a 1 TB SSD Read/Write cache which hopefully improves the issue, but I may end up creating a new subnet for iSCSI traffic to alleviate network hiccups.
    • Slow and steady wins the race – In my haste to get everything working again, I tried some things that did more harm than good. Had I done a bit more research and found the appropriate articles earlier, I probably could have recovered without rebuilding.