• Kubernetes Observability, Part 4 – Using Linkerd for Service Observability

    This post is part of a series on observability in Kubernetes clusters:

    As we start to look at traffic within our Kubernetes clusters, the notion of adding a service mesh crept into our discussions. I will not pretend to be an expert in service meshes, but the folks at bouyant.io (the creators of Linkerd) have done a pretty good job of explaining service meshes for engineers.

    My exercise to install Linkerd as a cluster was an exercise in “can I do it” more than having a need for a service mesh in place. However, Linkerd’s architecture is such that I can have Linkerd installed in the cluster, but only active on services that need it. This is accomplished via pod annotations, and make the system very configurable.

    Installing Linkerd

    With my ArgoCD setup, adding Linkerd as a cluster tool was pretty simple: I added the chart definition to the repository, then added a corresponding ApplicationSet definition. The ApplicationSet defined a cluster generator with a label match, meaning Linkerd would only be installed to clusters where I added spydersoft.io/linkerd=true as a label on the ArgoCD cluster secret.

    The most troublesome part of all of the installation process was figuring out how to manage Linkerd via GitOps. The folks at Linkerd, however, have a LOT of guides to help. You can review my chart definition for my installation methods, however, that was built from the following Linkerd articles:

    Many kudos to the Linkerd team, as their documentation was thorough and easy to follow.

    Adding Linkerd-viz

    Linkerd-viz is an add-on to Linkerd that has its own helm chart. As such, I manage it as a separate cluster tool. The visualization add-on has a dashboard that can be exposed via ingress and provide an overview of Linkerd and the metrics it is collecting. In my case, I tried to expose Linkerd-viz on a subpath (using my cluster’s internal domain name as the host). I ran into some issues (more on that below), but overall it works well.

    I broke it…

    As I started adding podAnnotations to inject Linkerd into my pods, things seemed to be “just working.” I even decorated my Nginx ingress controllers following the Linkerd guide, which meant traffic within my cluster was all going through Linkerd. This seemed to work well, until I tried to access my installation of Home Assistant. I spent a good while trying to debug, but as soon as I removed the pod annotations from Nginx, Home Assistant started working. While I am sure there is a way to fix that, I have not had much time to devout to the home lab recently, so that is on my to do list.

    I also noticed that the Linkerd-viz dashboard does not, at all, like to be hosted in a non-root URL. This has been documented as a bug in Linkerd, but is currently marked with the “help wanted” tag, so I am not expecting it to be fixed anytime soon. However, that bug identifies an ingress configuration snippet that can be added to the ingress definition to provide some basic rewrite functionality. It is a dirty workaround, and does not fix everything, but it is servicable.

    Benefits?

    For the pods that I have marked up, I can glance at the network traffic and latency between the services. I have started to create Grafana dashboards in my external instance to pull those metrics into an easy-to-read graphs for network performance.

    I have a lot more learning to do when it comes to Linkerd. While it is installed and running, I am certainly not using it for any heavy tasking. I hope to make some more time to investigate, but for now, I have some additional network metrics that help me understand what is going on in my clusters.

  • Kubernetes Observability, Part 3 – Dashboards with Grafana

    This post is part of a series on observability in Kubernetes clusters:

    What good is Loki’s log collection or Prometheus’ metrics scraping without a way to see it all? Both Loki and Prometheus are products born from Grafana Labs, which is building itself as an observability stack similar to Elastic. I have used both, and Grafana’s stack is much easier to get started with than Elastic. For my home lab, it is perfect: simple start and config, fairly easy monitoring, and no real administrative work to do.

    Installing Grafana

    The Helm chart provided for Grafana was very easy to use, and the “out-of-box” configuration was sufficient to get started. I configured a few more features to get the most out of my instance:

    • Using an existing secret for admin credentials: this secret is created by an ExternalSecret resource that pulls secrets from my local Hashicorp Vault instance.
    • Configuration for Azure AD users. Grafana’s documentation details additional actions that need done in your Azure AD instance. Note that the envFromSecret is the Kubernetes Secret that gets expanded to the environment, storing my Azure AD ClientID and Client Secret.
    • Added the grafana-piechart-panel plugin, as some of the dashboards I downloaded referenced that.
    • Enabled a Prometheus ServiceMonitor resource to scrape Grafana metrics.
    • Annotated the services and pods for Linkerd.

    Adding Data Sources

    With Grafana up and running, I added 5 data sources. Each of my clusters has their own Prometheus instance, and I added my Loki instance for logs. Eventually, Thanos will aggregate my Prometheus metrics into a single data source, but that is a topic for another day.

    Exploring

    The Grafana interface is fairly intuitive. The Explore section lets you poke around your data sources to review the data you are collecting. There are “builders” available to help you construct your PromQL (Prometheus Query Language) or LogQL (Log Query Language) queries. Querying Metrics automatically displays a line chart with your values, making it pretty easy to review your query results and prepare the query for inclusion in a dashboard.

    When debugging my applications, I use the Explore section almost exclusively to review incoming logs. A live log view and search with context makes it very easy to find warnings and errors within the log entries and determine issues.

    Building Dashboards

    In my limited use of the ELK stack, one thing that always got me was the barrier of entry into Kibana. I have always found that having examples to tinker with is a much easier way for me to learn, and I could never find good examples of Kibana dashboards that I could make my own.

    Grafana, however, has a pretty extensive list of community-built dashboards from which I could begin my learning. I started with some of the basics, like Cluster Monitoring for Kubernetes. And, well, even in that one, I ran into some issues. The current version in the community uses kubernetes_io_hostname as a label for the node name. It would seem that label has changed to node in kube-state-metrics, so I had to import that dashboard and make changes to the queries in order for the data to show.

    I found a few other dashboards that illustrated what I could do with Grafana:

    • ArgoCD – If you add a ServiceMonitor to ArgoCD, you can collect a number of metrics around application status and synchronizations, and this dashboard gives a great view of how ArgoCD is running.
    • Unifi Poller DashboardsUnifi Poller is an application that polls the Unifi controller to pull metrics and expose them to Prometheus in the OpenTelemetry standard. It includes dashboards for a number of different metrics.
    • NGINX Ingress – If you use NGINX for your Ingress controller, you can add a ServiceMonitor to scrape metrics on incoming network traffic.

    With these examples in hand, I have started to build out my own dashboards for some of my internal applications.

  • Kubernetes Observability, Part 2 – Collecting Metrics with Prometheus

    This post is part of a series on observability in Kubernetes clusters:

    Part 1 – Collecting Logs with Loki
    Part 2 – Collecting Metrics with Prometheus (this post)
    Part 3 – Building Grafana Dashboards
    Part 4 – Using Linkerd for Service Observability
    Part 5 – Using Mimir for long-term metric storage

    “Prometheus” appears in many Kubernetes blogs the same way that people whisper a famous person’s name as they enter a crowded room. Throughout a lot of my initial research, particularly with the k8s-at-home project, I kept seeing references to Prometheus in various Helm charts. Not wanting to distract myself, I usually just made sure it was disabled and skipped over it.

    As I found the time to invest in Kubernetes observability, gathering and viewing metrics became a requirement. So I did a bit of a deep dive into Prometheus and how I could use it to view metrics.

    The Short, Short Version

    By no means am I going to regurgitate the documentation around Prometheus. There is a lot to ingest around settings and configuration, and I have not even begun to scratch the surface around what I can do with Prometheus. My simple goal was to get Prometheus installed and running on my clusters, gathering metrics for my clusters. And I did this by installing Prometheus on each of my clusters using Bitnami‘s kube-prometheus Helm chart. Check out the cluster-tools folder in my Argo repository for an example.

    The Bitnami chart includes kube-state-metrics and node-exporter, in addition to settings for Thanos as a sidecar. So, with the kube-prometheus install up and running, I was able to create a new data source in Grafana and view the collected metrics. These metrics were fairly basic, consisting mainly of the metrics being reported by kube-state-metrics.

    Enabling ServiceMonitors

    Many of the charts I use to install cluster tools come with values sections that enabled metrics collection via a ServiceMonitor CRD. A ServiceMonitor instance instructs Prometheus how to discover and scrape services within the cluster.

    For example, I use the following Helm charts with ServiceMonitor definitions:

    So, in order to get metrics on these applications, I simply edited my values.yaml file and enabled the creation of a ServiceMonitor resource for each.

    Prometheus as a piece of the puzzle

    I know I skipped a LOT. In a single cluster environment, installing Prometheus should be as simple as choosing your preferred installation method and getting started scraping. As always, my multi-cluster home lab presents some questions that mirror questions I get at work. In this case, how do we scale out to manage multiple clusters?

    Prometheus allows for federation, meaning one Prometheus server can scrape other Prometheus servers to gather metrics. Through a lot of web searching, I came across an article from Vikram Vaswani and Juan Ariza centered on creating a multi-cluster monitoring dashboard using Prometheus, Grafana, and Thanos. The described solution is close to what I would like to see in my home lab. While I will touch on Grafana and Thanos in later articles, the key piece of this article was installing Prometheus in each cluster, and, in as described, creating a Thanos sidecar to aid in the implementation of Thanos.

    Why Bitnami?

    I have something of a love/hate relationship with Bitnami. The VMWare-owned property is a collection of various images and Helm charts meant to be the “app store” for your Kubernetes cluster. Well, more specifically, it is meant to be the app store for your VMWare Tanzu environments, but the Helm charts are published so that others can use them.

    Generally, I prefer using “official” charts for applications. This usually ensures that the versions are the most current, and the chart is typically free from the “bloat” that can sometimes happen in Bitnami charts, where they package additional sub-charts to make things easy.

    That is not to say that I do not use Bitnami charts at all. My WordPress installation uses the Bitnami chart, and it serves its purpose. However, being community-driven, the versions can lag a bit. I know the 6.x release of WordPress took a few weeks to make it from WordPress to Bitnami.

    For Prometheus, there are a few community-driven charts, but you may notice that Helm is not in the list of installation methods. This, coupled with the desire to implement Thanos later, led me to the kube-prometheus chart by Bitnami. You may have better mileage with one of the Prometheus Community charts, but for now I am sticking to the Bitnami chart.

  • Kubernetes Observability, Part 1 – Collecting Logs with Loki

    This post is part of a series on observability in Kubernetes clusters:

    I have been spending an inordinate amount of time wrapping my head around Kubernetes observability for my home lab. Rather than consolidate all of this into a single post, I am going to break up my learnings into bite-sized chunks. We’ll start with collecting cluster logs.

    The Problem

    Good containers generate a lot of logs. Outside of getting into the containers via kubectl, logging is the primary mechanism for identifying what is happening within a particular container. We need a way to collect the logs from various containers and consolidate them in a single place.

    My goal was to find a log aggregation solution that gives me insights into all the logs in the cluster, without needing special instrumentation.

    The Candidates – ELK versus Loki

    For a while now, I have been running an ELK (Elasticsearch/Logstash/Kibana) stack locally. My hobby projects utilized an ElasticSearch sink for Serilog to ship logs directly from those images to ElasticSearch. I found that I could install Filebeats into the cluster and ship all container logs to Elasticsearch, which allowed me to gather the logs across containers.

    ELK

    Elasticsearch is a beast. It’s capabilities are quite impressive as a document and index solution. But those capabilities make it really heavy for what I wanted, which was “a way to view logs across containers.”

    For a while, I have been running an ELK instance on my internal tools cluster. It has served its purpose: I am able to report logs from Nginx via Filebeats, and my home containers are built with a Serilog sink that reports logs directly to elastic. I recently discovered how to install Filebeats onto my K8 clusters, which allows it to pull logs from the containers. This, however, exploded my Elastic instance.

    Full disclosure: I’m no Elasticsearch administrator. Perhaps, with proper experience, I could make that work. But Elastic felt heavy, and I didn’t feel like I was getting value out of the data I was collecting.

    Loki

    A few of my colleagues found Grafana Loki as a potential solution for log aggregation. I attempted an installation to compare the solutions.

    Loki is a log aggregation system which provides log storage and querying. It is not limited to Kubernetes: there are number of official clients for sending logs, as well as some unofficial third-party ones. Loki stores your incoming logs (see Storage below), creates indices on some of the log metadata, and provides a custom query language (LogQL) to allow you to explore your logs. Loki integrates with Grafana for visual log exploration, LogCLI for command line searches, and Prometheus AlertManager for routing alerts based on logs.

    One of the clients, Promtail, can be installed on a cluster to scrape logs and report them back to Loki. My colleague suggested a Loki instance on each cluster. I found a few discussions in Grafana’s Github issues section around this, but it lead to a pivotal question.

    Loki per Cluster or One Loki for All?

    I laughed a little as I typed this, because the notion of “multiple Lokis” is explored in decidedly different context in the Marvel series. My options were less exciting: do I have one instance of Loki that collects data from different clients across the clusters, or do I allow each cluster to have it’s own instance of Loki, and use Grafana to attach to multiple data sources?

    Why consider Loki on every cluster?

    Advantages

    Decreased network chatter – If every cluster has a local Loki instance, then logs for that cluster do not have far to go, which means minimal external network traffic.

    Localized Logs – Each cluster would be responsible for storing its own log information, so finding logs for a particular cluster is as simple as going to the cluster itself

    Disadvantages

    Non-centralized – The is no way to query logs across clusters without some additional aggregator (like another Loki instance). This would cause duplicative data storage

    Maintenance Overhead – Each cluster’s Loki instance must be managed individually. This can be automated using ArgoCD and the cluster generator, but it still means that every cluster has to run Loki.

    Final Decision?

    For my home lab, Loki fits the bill. The installation was easy-ish (if you are familiar with Helm and not afraid of community forums), and once configured, it gave me the flexibility I needed with easy, declarative maintenance. But, which deployment method?

    Honestly, I was leaning a bit towards the individual Loki instances. So much so that I configured Loki as a cluster tool and deployed it to all of my instances. And that worked, although swapping around Grafana data sources for various logs started to get tedious. And, when I thought about where I should report logs for other systems (like my Raspberry PI-based Nginx proxy), doubts crept in.

    Thinking about using an ELK stack, I certainly would not put an instance of Elasticsearch on every cluster. While Loki is a little lighter than elastic, it’s still heavy enough that it’s worthy of a single, properly configured instance. So I removed the cluster-specific Loki instances and configured one instance.

    With promtail deployed via an ArgoCD ApplicationSet with a cluster generator, I was off to the races.

    A Note on Storage

    Loki has a few storage options, with the majority being cloud-based. At work, with storage options in both Azure and GCP, this is a non-issue. But for my home lab, well, I didn’t want to burn cash storing logs when I have a perfectly good SAN sitting at home.

    My solution there was to stand up an instance of MinIO to act as S3 storage for Loki. Now, could I have run MinIO on Kubernetes? Sure. But, in all honesty, it got pretty confusing pretty quickly, and I was more interested in getting Loki running. So I spun up a Hyper-V machine with Ubuntu 22.04 and started running MinIO. Maybe one day I’ll work on getting MinIO running on one of my K8 clusters, but for now, the single machine works just fine.

  • An Impromptu Home Lab Disaster Recovery Session

    It has been a rough 90 days for my home lab. We have had a few unexpected power outages which took everything down. And, for the unexpected outages, things came back up. Over the weekend, I was doing some electrical work outside, wiring up outlets and lighting. Being safety conscious, I killed the power to the breaker I was tying into inside, not realize it was the same breaker that the server was on. My internal dialog went something like this:

    • “Turn off breaker Basement 2”
    • ** clicks breaker **
    • ** Hears abrupt stop of server fans **
    • Expletive….

    When trying to recover from that last sequence, I ran into a number of issues.

    • I’m CRUSHING that server when it comes back up: having 20 VMs attempting to start simultaneously is causing a lot of resource contention.
    • I had to run fsck manually on a few machines to get them back up and running.
    • Even after getting the machines running, ETCD was broken on two of my four clusters.

    Fixing my Resource Contention Issues

    I should have done this from the start, but all of my VMs had their Automatic Start Action set to Restart if previously running. That’s great in theory, but, in practice, starting 20 or so VMs on the same hypervisor is not recommended.

    Part of Hyper-V’s Automatic Start Action panel is an Automatic Startup Delay. In Powershell, it is the AutomaticStartDelay property on the VirtualMachine object (what’s returned from a Get-VM call). My ultimate goal is to set that property to stagger start my VMs. And I could have manually done that and been done in a few minutes. But, how do I manage that when I spin up new machines? And can I store some information on the VM to reset that value as I play around with how long each VM needs to start up?

    Groups and Offsets

    All of my VMs can be grouped based on importance. And it would have been easy enough to start 2-3 VMs in group 1, wait a few minutes, then do group 2, etc. But I wanted to be able to assign offsets within the groups to better address contention. In an ideal world, the machines would come up sequentially to a point, and then 2 or 3 at a time after the main VMs have started. So I created a very simple JSON object to track this:

    {
      "startGroup": 1,
      "delayOffset": 120
    }

    There is a free-text Notes field on the VirtualMachine object, so I used that to set a startGroup and delayOffset for each of my VMs. Using a string of Powershell commands, I was able to get a tabular output of my custom properties:

    get-vm | Select Name, State, AutomaticStartDelay, @{n='ASDMin';e={$_.AutomaticStartDelay / 60}}, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} | Sort-Object AutomaticStartDelay | format-table
    • Get-VM – Get a list of all the VMs on the machine
    • Select Name, ... – The Select statement (alias to Select-Object) pulls values form the object. There are two calculated properties that pull values from the Notes field as a JSON object.
    • Sort-Object – Sort the list by the AutomaticStartDelay property
    • Format-Table – Format the response as a table.

    At that point, the VM had its startGroup and delayOffset, but how can I set the AutomaticStartDelay based on those? More Powershell!!

    get-vm | Select Name, State, AutomaticStartDelay, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} |? {$_.startGroup -gt 0} | % { set-vm -name $_.name -AutomaticStartDelay ((($_.startGroup - 1) * 480) + $_.delayOffset) }

    The first two commands are the same as the above, but after that:

    • ? {$_.startGroup -gt 0} – Use Where-Object (? alias) to select VMs with a startGroup value
    • % { set-vm -name ... }ForEach-Object (% alias) in that group, set the AutomaticStartDelay.

    In the command above, I hard-coded the AutomaticStartDelay to the following formula:

    ((startGroup - 1) * 480) + delayOffset

    With this formula, the server will wait 4 minutes between groups, and add a delay within the group should I choose. As an example, my domain controllers carry the following values:

    # Primary DC
    {
      "startGroup": 1,
      "delayOffset": 0
    }
    # Secondary DC
    {
      "startGroup": 1,
      "delayOffset": 120
    }

    The calculated delay for my domain controllers is 0 and 120 seconds, respectively. The next group won’t start until 480 seconds (4 minutes), which gives my DCs 4 minutes on their own to boot up.

    Now, there will most likely be some tuning involved in this process, which is where my complexity becomes helpful: say I can boot 2-3 machines every 3 minutes… I can just re-run the population command with a new formula.

    Did I over-engineer this? Probably. But the point is, use AutomaticStartDelay if you are running a lot of VMs on a Hypervisor.

    Restoring ETCD

    Call it fate, but that last power outage ended up causing ETCD issues in two of my servers. I had to run fsck manually on a few of my servers to repair the file system. Even when the servers were up and running, two of my clusters had problems with their ETCD services.

    In the past, my solution to this was “nuke the cluster and rebuild,” but I am trying to be a better Kubernetes administrator, so this time, I took the opportunity to actually read the troubleshooting documentation that Rancher provides.

    Unfortunately, I could not get past “step one:” ETCD was not running. Knowing that it was most likely a corruption of some kind and that I had a relatively up-to-date ETCD snapshot, I did not burn too much time before going to the restore.

    rke etcd snapshot-restore --name snapshot_name_etcd --config cluster.yml

    That command worked like a charm, and my clusters we back up and running.

    To Do List

    I have a few things on my to-do list following this adventure:

    1. Move ETCD snapshots off of the VMs and onto the SAN. I would have had a lot of trouble bringing ETCD back up if those snapshots were not available because the node they were on went down.
    2. Update my Packer provisioning scripts to include writing the startup configuration to the VM notes.
    3. Build an API wrapper that I can run on the server to manage the notes field.

    I am somewhat interested in testing how the AutomaticStartDelay changes will affect my server boot time. However, I am planning on doing that on a weekend morning during a planned maintenance, not on a random Thursday afternoon.

  • Creating a simple Nginx-based web server image

    One of the hardest parts of blogging is identifying topics. I sometimes struggle with identifying things that I have done that would be interesting or helpful to others. In trying to establish a “rule of thumb” for such decisions, I think things that I have done at least twice qualify as potential topics. As it so happens, I have had to construct simple web server containers twice in the last few weeks.

    The Problem

    Very simply, I wanted to be able to build a quick and painless container to host some static web sites. They are mostly demo sites for some of the UI libraries that we have been building. One is raw HTML, the other is built using Storybook.js, but both end up being a set of HTML/CSS/JS files to be hosted.

    Requirements

    The requirements for this one are pretty easy:

    • Host a static website
    • Do not run as root

    There was no requirement to be able to change the content outside of the image: changes would be handled by building a new image.

    My Solution

    I have become generally familiar with Nginx for a variety of uses. It serves as a reverse proxy for my home lab and is my go-to ingress controller for Kubernetes. Since I am familiar with its configuration, I figured it would be a good place to start.

    Quick But Partial Success

    The “do not run as root” requirement led me to the Nginx unprivileged image. With that as a base, I tried something pretty quick and easy:

    # Dockerfile
    FROM nginxinc/nginx-unprivileged:1.20 as runtime
    
    
    COPY output/ /usr/share/nginx/html

    Where output contains the generated HTML files that I wanted to host.

    This worked great for the first page that loaded. However, links to other pages within the site kept coming back from Nginx with :8080 as the port. Out networking configuration is offloading SSL outside of the cluster and using ingress within the cluster, so I did not want any port forwarding at all.

    Custom Configuration Completes the Set

    At that point, I realized that I needed to configure Nginx to disabled the port redirects, and then include the new configuration in my container. So I trapsed through the documentation for the Nginx containers. As it turns out, the easiest way to configure these images is to replace the default.conf file in /etc/nginx/conf.d folder.

    So I went about creating a new Nginx config file with the appropriate settings:

    server { 
      listen 8080;
      server_name localhost;
      port_in_redirect off;
      
      location / {
        root /usr/share/nginx/html;
        index index.html index.htm;
      }
      error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   /usr/share/nginx/html;
        }
    }

    From there, my Dockerfile changed only slightly:

    # Dockerfile
    FROM nginxinc/nginx-unprivileged:1.20 as runtime
    COPY nginx/default.conf /etc/nginx/conf.d/default.conf
    COPY output/ /usr/share/nginx/html

    Success!

    With those changes, the image built with the appropriate files and the links no longer had the port redirect. Additionally, my containers are not running as root, so I do not run afoul of our cluster’s policy management rules.

    Hope this helps!

  • Nginx Reverse proxy: A slash makes all the difference.

    I have been doing some work to build up some standard processes for Kubernetes. ArgoCD has become a big part of that, as it allows us to declaratively manage the state of our clusters.

    After recovering from a small blow-up in the home lab (post coming), I wanted to migrate my cluster tools to utilize the label selector feature of the ApplicationSet’s Cluster Generator. Why? It allows me to selectively manage tools in the cluster. After all, not every cluster needs the nfs-external-provisioner to provide a StorageClass for pod file storage.

    As part of this, I wanted to deploy to tools to the local Argo cluster. In order to do that, the local cluster needs a secret. I tried to follow the instructions, but when I clicked to view the cluster details, I got a 404 error. I dug around the logs, and my request wasn’t even getting to the Argo application server container.

    When I looked at the Ingress controller logs, it showed the request looked something like this:

    my.url.com/api/v1/clusters/http://my.clusterurl.svc

    Obviously, that’s not correct. The call coming from the UI is this:

    my.url.com/api/v1/clusters/http%3A%2F%2Fmy.clusterurl.svc

    Notice the encoding: My Nginx reverse proxy (running on a Raspberry Pi outside my main server) was decoding the request before passing it along to the cluster.

    The question was, why? A quick Google search lead right to their documentation:

    • If proxy_pass is specified with URI, when passing a request to the server, part of a normalized request URI matching the location is replaced by a URI specified in the directive
    • If proxy_pass is specified without URI, a request URI is passed to the server in the same form as sent by a client when processing an original request

    What? Essentially, it means the slash at the end will dictate whether Nginx does anything with the request.

    ## With a slash at the end, the client request is normalized
    location /name/ {
        proxy_pass http://127.0.0.1/remote/;
    }
    
    ## Without a slash, the request is as-is
    location /name/ {
        proxy_pass http://127.0.0.1;
    }

    Removing the slash, the Argo UI was able to load the cluster details correctly.

  • The cascading upgrade continues…

    I mentioned in a previous post that I made the jump to Windows 11 on both my work and home laptops. As it turns out, this is causing me to re-evaluate some of my other systems and upgrade them as needed.

    Old (really old) Firmware

    The HP ProLiant Gen8 Server I have been running for a few years had a VERY old firmware version on it. When I say “very old” I mean circa 2012. For all intents and purposes, it did its job. The few times I use it, however, are the most critical: i.e., the domain controller VMs won’t come up, and I need to use the remote console to log in to the server and restart something.

    This particular version of the iLO firmware worked best in Internet Explorer, particularly for the remote access portion. Additionally, I had never taken the time to create a proper SSL certificate for the iLO interface, which usually meant a few refreshes were required to get in.

    In Windows 11 and Edge, this just was not possible. The security settings prevented the access on the invalid SSL. Additionally, remote access required a Click-once application running .NET Framework 3.5. So even if I got past the invalid SSL (which I did), the remote console would not work.

    Time for an upgrade?

    When I first setup this server in 2018, I vaguely remember looking for and not finding firmware updates for the iLO. Clearly I was mistaken: the Gen8 runs iLO 4, which has firmware updates as recent as April of 2022. After reading through the release notes and installation instructions, I felt pretty confident that this upgrade would solve my issue.

    The upgrade process was pretty easy: extract the .bin firmware from the installer, upload via the iLO interface, and wait a bit for the iLO to restart. At that point, I was up and running with a new and improved interface.

    Solving the SSL Issue

    The iLO generates a self-signed, but backdated SSL certificate. You can customize it, buy only by generating a CSR via the iLO interface, getting a certificate back, and importing that certificate into the iLO. I really did not want to go through the hassle of create a certificate authority, or figure out how to use Let’s Encrypt to fulfill the CSR, so I took a slightly different path.

    1. Generate a self-signed root CA Certificate.
    2. Generate the CSR from the iLO interface
    3. Sign the CSR with the self-signed root CA.
    4. Install the root CA as a Trusted Root Certificate on my local machine.

    This allows me the ability to connect to the iLO interface without getting the SSL errors, which is enough for me.

    A lot has changed in 10 years…

    The iLO 4 interface got a nice facelift in the past 10 years. A new REST API lets me get to some data from my server, including power and thermal data. Most importantly, the Remote Console got an upgrade to an HTML 5 interface, which means I do not have to rely on Internet Explorer anymore. I am pretty happy with the ease of the process, although I do wish I would have known and done this sooner.

  • Breaking an RKE cluster in one easy step

    With the release of Ubuntu’s latest LTS release (22.04, or “Jammy Jellyfish), I wanted to upgrade my Kubernetes nodes from 20.04 to 22.04. What I had hoped would be an easy endeavor turned out to be a weeks-long process with destroyed clusters and, ultimately, an ETCD issue.

    The Hypothesis

    As I viewed it, I had two paths to this upgrade: in-place upgrade on the nodes, or bring up new nodes and decommission the old ones. As the latter represents the “Yellowstone” approach, I chose that one. My plan seemed simple:

    • Spin up new Ubuntu 22.04 nodes using Packer.
    • Add the new nodes to the existing clusters, assigning the new nodes all the necessary roles (I usually have 1 control_plane, 3 etcd, and all are worker
    • Remove the control_plane from the old node and verify connectivity
    • Remove the old nodes (cordon, drain, and remove)

    Internal Cluster

    After updating my Packer scripts for 22.04, I spun up new nodes for my internal cluster, which has an ELK stack for log collection. I added the new nodes without a problem, and thought that maybe I could combine the last two steps and just remove all the nodes at the same time.

    That ended up with the Rancher CLI getting stuck in checking ETCD health. I may have gotten a little impatient and just killed the Rancher CLI process mid-job. This left me with, well, a dead internal cluster. So, I recovered the cluster (see my note on cluster recovery below) and thought I’d try again with my non-production cluster.

    Non-production Cluster

    Some say that the definition of insanity is doing the same thing and expecting a different result. My logic, however, was that I made two mistakes the first time through:

    • Trying to remove the controlplane alongside of the etcd nodes in the same step
    • Killing the RKE CLI command mid-stream

    So I spun up a few new non-production nodes, added them to the cluster, and simply removed controlplane from the old node.

    Success! My new controlplane node took over, and cluster health seemed good. And, in the interest of only changing one variable at a time, I decided to try and remove just one old node from the cluster.

    Kaboom….

    Same issue in recovering the etcd volume. So I recovered the cluster and returned to the drawing board.

    Testing

    At this point, I only had my Rancher/Argo cluster and my production cluster, which houses, among other things, this site. I had no desire for wanton destruction of these clusters, so I setup a test cluster to see if I could replicate my results. I was able to, at which point I turned to the RKE project in Github for help.

    After a few days, someone pointed me to a relatively new Rancher issue describing my predicament. If you read through those various issues, you’ll find that etcd 3.5 has an issue where node removal can corrupt its database, causing issues such as mine. The issue was corrected in 3.5.3.

    I upgraded my RKE CLI and ran another test with the latest rancher Kubernetes version. This time, finally, success! I was able to remove etcd nodes without crashing the cluster.

    Finishing up / Lessons Learned

    Before doing anything, I upgraded all of my clusters to the latest supported Kubernetes version. In my case, this is v1.23.6-rancher1-1. Following the steps above, I was, in fact, able to progress through upgrading both my Rancher/Argo cluster and my production cluster without bringing down the clusters.

    Lessons learned? Well, patience is key (don’t kill cluster management processes mid-effort), but also, sometimes it is worth a test before you try things. Had any of these clusters NOT been my home lab clusters, this process, seemingly simple, would have incurred downtime in more important systems.

    A note on Cluster Recovery

    For both the internal and non-production clusters, I could have scrambled to recover the ETCD volume for that cluster and brought it back to life. But I realized that there was no real valuable data in either cluster. The ELK logs are useful real-time but I have not started down the path of analyzing history, so I didn’t mind losing them. And even those are on my SAN, and the PVCs get archived when no longer in use.

    Instead of a long, drawn out recovery process, I simply stood up brand new clusters, pointed my instance of Argo to them and updated my Argo applications to deploy to the new cluster. Inside of an hour, my apps were back up and running. This is something of a testament to the benefits of storing a cluster’s state in a repository: recreation was nearly automatic.

  • Tech Tip – Azure DevOps Pipelines Newline handling

    Just a quick note: It would seem that somewhere between Friday, April 29, 2022 and Monday, May 2, 2022, Azure DevOps pipelines changed their handling of newlines in YAML literal blocks. The change caused our pipelines to stop executing with the following error:

    While scanning a literal block scalar, found extra spaces in first line

    What caused it? Multi-line, inline block definitions.

    - powershell: |
        
        Write-Host "Azure Fails on this now"
      displayName: Bad Script
      
    - powershell: |
        Write-Host "Azure works with this"
      displayName: Good Script