Author: Matt

  • Kubernetes Observability, Part 1 – Collecting Logs with Loki

    This post is part of a series on observability in Kubernetes clusters:

    I have been spending an inordinate amount of time wrapping my head around Kubernetes observability for my home lab. Rather than consolidate all of this into a single post, I am going to break up my learnings into bite-sized chunks. We’ll start with collecting cluster logs.

    The Problem

    Good containers generate a lot of logs. Outside of getting into the containers via kubectl, logging is the primary mechanism for identifying what is happening within a particular container. We need a way to collect the logs from various containers and consolidate them in a single place.

    My goal was to find a log aggregation solution that gives me insights into all the logs in the cluster, without needing special instrumentation.

    The Candidates – ELK versus Loki

    For a while now, I have been running an ELK (Elasticsearch/Logstash/Kibana) stack locally. My hobby projects utilized an ElasticSearch sink for Serilog to ship logs directly from those images to ElasticSearch. I found that I could install Filebeats into the cluster and ship all container logs to Elasticsearch, which allowed me to gather the logs across containers.

    ELK

    Elasticsearch is a beast. It’s capabilities are quite impressive as a document and index solution. But those capabilities make it really heavy for what I wanted, which was “a way to view logs across containers.”

    For a while, I have been running an ELK instance on my internal tools cluster. It has served its purpose: I am able to report logs from Nginx via Filebeats, and my home containers are built with a Serilog sink that reports logs directly to elastic. I recently discovered how to install Filebeats onto my K8 clusters, which allows it to pull logs from the containers. This, however, exploded my Elastic instance.

    Full disclosure: I’m no Elasticsearch administrator. Perhaps, with proper experience, I could make that work. But Elastic felt heavy, and I didn’t feel like I was getting value out of the data I was collecting.

    Loki

    A few of my colleagues found Grafana Loki as a potential solution for log aggregation. I attempted an installation to compare the solutions.

    Loki is a log aggregation system which provides log storage and querying. It is not limited to Kubernetes: there are number of official clients for sending logs, as well as some unofficial third-party ones. Loki stores your incoming logs (see Storage below), creates indices on some of the log metadata, and provides a custom query language (LogQL) to allow you to explore your logs. Loki integrates with Grafana for visual log exploration, LogCLI for command line searches, and Prometheus AlertManager for routing alerts based on logs.

    One of the clients, Promtail, can be installed on a cluster to scrape logs and report them back to Loki. My colleague suggested a Loki instance on each cluster. I found a few discussions in Grafana’s Github issues section around this, but it lead to a pivotal question.

    Loki per Cluster or One Loki for All?

    I laughed a little as I typed this, because the notion of “multiple Lokis” is explored in decidedly different context in the Marvel series. My options were less exciting: do I have one instance of Loki that collects data from different clients across the clusters, or do I allow each cluster to have it’s own instance of Loki, and use Grafana to attach to multiple data sources?

    Why consider Loki on every cluster?

    Advantages

    Decreased network chatter – If every cluster has a local Loki instance, then logs for that cluster do not have far to go, which means minimal external network traffic.

    Localized Logs – Each cluster would be responsible for storing its own log information, so finding logs for a particular cluster is as simple as going to the cluster itself

    Disadvantages

    Non-centralized – The is no way to query logs across clusters without some additional aggregator (like another Loki instance). This would cause duplicative data storage

    Maintenance Overhead – Each cluster’s Loki instance must be managed individually. This can be automated using ArgoCD and the cluster generator, but it still means that every cluster has to run Loki.

    Final Decision?

    For my home lab, Loki fits the bill. The installation was easy-ish (if you are familiar with Helm and not afraid of community forums), and once configured, it gave me the flexibility I needed with easy, declarative maintenance. But, which deployment method?

    Honestly, I was leaning a bit towards the individual Loki instances. So much so that I configured Loki as a cluster tool and deployed it to all of my instances. And that worked, although swapping around Grafana data sources for various logs started to get tedious. And, when I thought about where I should report logs for other systems (like my Raspberry PI-based Nginx proxy), doubts crept in.

    Thinking about using an ELK stack, I certainly would not put an instance of Elasticsearch on every cluster. While Loki is a little lighter than elastic, it’s still heavy enough that it’s worthy of a single, properly configured instance. So I removed the cluster-specific Loki instances and configured one instance.

    With promtail deployed via an ArgoCD ApplicationSet with a cluster generator, I was off to the races.

    A Note on Storage

    Loki has a few storage options, with the majority being cloud-based. At work, with storage options in both Azure and GCP, this is a non-issue. But for my home lab, well, I didn’t want to burn cash storing logs when I have a perfectly good SAN sitting at home.

    My solution there was to stand up an instance of MinIO to act as S3 storage for Loki. Now, could I have run MinIO on Kubernetes? Sure. But, in all honesty, it got pretty confusing pretty quickly, and I was more interested in getting Loki running. So I spun up a Hyper-V machine with Ubuntu 22.04 and started running MinIO. Maybe one day I’ll work on getting MinIO running on one of my K8 clusters, but for now, the single machine works just fine.

  • An Impromptu Home Lab Disaster Recovery Session

    It has been a rough 90 days for my home lab. We have had a few unexpected power outages which took everything down. And, for the unexpected outages, things came back up. Over the weekend, I was doing some electrical work outside, wiring up outlets and lighting. Being safety conscious, I killed the power to the breaker I was tying into inside, not realize it was the same breaker that the server was on. My internal dialog went something like this:

    • “Turn off breaker Basement 2”
    • ** clicks breaker **
    • ** Hears abrupt stop of server fans **
    • Expletive….

    When trying to recover from that last sequence, I ran into a number of issues.

    • I’m CRUSHING that server when it comes back up: having 20 VMs attempting to start simultaneously is causing a lot of resource contention.
    • I had to run fsck manually on a few machines to get them back up and running.
    • Even after getting the machines running, ETCD was broken on two of my four clusters.

    Fixing my Resource Contention Issues

    I should have done this from the start, but all of my VMs had their Automatic Start Action set to Restart if previously running. That’s great in theory, but, in practice, starting 20 or so VMs on the same hypervisor is not recommended.

    Part of Hyper-V’s Automatic Start Action panel is an Automatic Startup Delay. In Powershell, it is the AutomaticStartDelay property on the VirtualMachine object (what’s returned from a Get-VM call). My ultimate goal is to set that property to stagger start my VMs. And I could have manually done that and been done in a few minutes. But, how do I manage that when I spin up new machines? And can I store some information on the VM to reset that value as I play around with how long each VM needs to start up?

    Groups and Offsets

    All of my VMs can be grouped based on importance. And it would have been easy enough to start 2-3 VMs in group 1, wait a few minutes, then do group 2, etc. But I wanted to be able to assign offsets within the groups to better address contention. In an ideal world, the machines would come up sequentially to a point, and then 2 or 3 at a time after the main VMs have started. So I created a very simple JSON object to track this:

    {
      "startGroup": 1,
      "delayOffset": 120
    }

    There is a free-text Notes field on the VirtualMachine object, so I used that to set a startGroup and delayOffset for each of my VMs. Using a string of Powershell commands, I was able to get a tabular output of my custom properties:

    get-vm | Select Name, State, AutomaticStartDelay, @{n='ASDMin';e={$_.AutomaticStartDelay / 60}}, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} | Sort-Object AutomaticStartDelay | format-table
    • Get-VM – Get a list of all the VMs on the machine
    • Select Name, ... – The Select statement (alias to Select-Object) pulls values form the object. There are two calculated properties that pull values from the Notes field as a JSON object.
    • Sort-Object – Sort the list by the AutomaticStartDelay property
    • Format-Table – Format the response as a table.

    At that point, the VM had its startGroup and delayOffset, but how can I set the AutomaticStartDelay based on those? More Powershell!!

    get-vm | Select Name, State, AutomaticStartDelay, @{n='startGroup';e= {(ConvertFrom-Json $_.Notes).startGroup}}, @{n='delayOffset';e= {(ConvertFrom-Json $_.Notes).delayOffset}} |? {$_.startGroup -gt 0} | % { set-vm -name $_.name -AutomaticStartDelay ((($_.startGroup - 1) * 480) + $_.delayOffset) }

    The first two commands are the same as the above, but after that:

    • ? {$_.startGroup -gt 0} – Use Where-Object (? alias) to select VMs with a startGroup value
    • % { set-vm -name ... }ForEach-Object (% alias) in that group, set the AutomaticStartDelay.

    In the command above, I hard-coded the AutomaticStartDelay to the following formula:

    ((startGroup - 1) * 480) + delayOffset

    With this formula, the server will wait 4 minutes between groups, and add a delay within the group should I choose. As an example, my domain controllers carry the following values:

    # Primary DC
    {
      "startGroup": 1,
      "delayOffset": 0
    }
    # Secondary DC
    {
      "startGroup": 1,
      "delayOffset": 120
    }

    The calculated delay for my domain controllers is 0 and 120 seconds, respectively. The next group won’t start until 480 seconds (4 minutes), which gives my DCs 4 minutes on their own to boot up.

    Now, there will most likely be some tuning involved in this process, which is where my complexity becomes helpful: say I can boot 2-3 machines every 3 minutes… I can just re-run the population command with a new formula.

    Did I over-engineer this? Probably. But the point is, use AutomaticStartDelay if you are running a lot of VMs on a Hypervisor.

    Restoring ETCD

    Call it fate, but that last power outage ended up causing ETCD issues in two of my servers. I had to run fsck manually on a few of my servers to repair the file system. Even when the servers were up and running, two of my clusters had problems with their ETCD services.

    In the past, my solution to this was “nuke the cluster and rebuild,” but I am trying to be a better Kubernetes administrator, so this time, I took the opportunity to actually read the troubleshooting documentation that Rancher provides.

    Unfortunately, I could not get past “step one:” ETCD was not running. Knowing that it was most likely a corruption of some kind and that I had a relatively up-to-date ETCD snapshot, I did not burn too much time before going to the restore.

    rke etcd snapshot-restore --name snapshot_name_etcd --config cluster.yml

    That command worked like a charm, and my clusters we back up and running.

    To Do List

    I have a few things on my to-do list following this adventure:

    1. Move ETCD snapshots off of the VMs and onto the SAN. I would have had a lot of trouble bringing ETCD back up if those snapshots were not available because the node they were on went down.
    2. Update my Packer provisioning scripts to include writing the startup configuration to the VM notes.
    3. Build an API wrapper that I can run on the server to manage the notes field.

    I am somewhat interested in testing how the AutomaticStartDelay changes will affect my server boot time. However, I am planning on doing that on a weekend morning during a planned maintenance, not on a random Thursday afternoon.

  • Creating a simple Nginx-based web server image

    One of the hardest parts of blogging is identifying topics. I sometimes struggle with identifying things that I have done that would be interesting or helpful to others. In trying to establish a “rule of thumb” for such decisions, I think things that I have done at least twice qualify as potential topics. As it so happens, I have had to construct simple web server containers twice in the last few weeks.

    The Problem

    Very simply, I wanted to be able to build a quick and painless container to host some static web sites. They are mostly demo sites for some of the UI libraries that we have been building. One is raw HTML, the other is built using Storybook.js, but both end up being a set of HTML/CSS/JS files to be hosted.

    Requirements

    The requirements for this one are pretty easy:

    • Host a static website
    • Do not run as root

    There was no requirement to be able to change the content outside of the image: changes would be handled by building a new image.

    My Solution

    I have become generally familiar with Nginx for a variety of uses. It serves as a reverse proxy for my home lab and is my go-to ingress controller for Kubernetes. Since I am familiar with its configuration, I figured it would be a good place to start.

    Quick But Partial Success

    The “do not run as root” requirement led me to the Nginx unprivileged image. With that as a base, I tried something pretty quick and easy:

    # Dockerfile
    FROM nginxinc/nginx-unprivileged:1.20 as runtime
    
    
    COPY output/ /usr/share/nginx/html

    Where output contains the generated HTML files that I wanted to host.

    This worked great for the first page that loaded. However, links to other pages within the site kept coming back from Nginx with :8080 as the port. Out networking configuration is offloading SSL outside of the cluster and using ingress within the cluster, so I did not want any port forwarding at all.

    Custom Configuration Completes the Set

    At that point, I realized that I needed to configure Nginx to disabled the port redirects, and then include the new configuration in my container. So I trapsed through the documentation for the Nginx containers. As it turns out, the easiest way to configure these images is to replace the default.conf file in /etc/nginx/conf.d folder.

    So I went about creating a new Nginx config file with the appropriate settings:

    server { 
      listen 8080;
      server_name localhost;
      port_in_redirect off;
      
      location / {
        root /usr/share/nginx/html;
        index index.html index.htm;
      }
      error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   /usr/share/nginx/html;
        }
    }

    From there, my Dockerfile changed only slightly:

    # Dockerfile
    FROM nginxinc/nginx-unprivileged:1.20 as runtime
    COPY nginx/default.conf /etc/nginx/conf.d/default.conf
    COPY output/ /usr/share/nginx/html

    Success!

    With those changes, the image built with the appropriate files and the links no longer had the port redirect. Additionally, my containers are not running as root, so I do not run afoul of our cluster’s policy management rules.

    Hope this helps!

  • Nginx Reverse proxy: A slash makes all the difference.

    I have been doing some work to build up some standard processes for Kubernetes. ArgoCD has become a big part of that, as it allows us to declaratively manage the state of our clusters.

    After recovering from a small blow-up in the home lab (post coming), I wanted to migrate my cluster tools to utilize the label selector feature of the ApplicationSet’s Cluster Generator. Why? It allows me to selectively manage tools in the cluster. After all, not every cluster needs the nfs-external-provisioner to provide a StorageClass for pod file storage.

    As part of this, I wanted to deploy to tools to the local Argo cluster. In order to do that, the local cluster needs a secret. I tried to follow the instructions, but when I clicked to view the cluster details, I got a 404 error. I dug around the logs, and my request wasn’t even getting to the Argo application server container.

    When I looked at the Ingress controller logs, it showed the request looked something like this:

    my.url.com/api/v1/clusters/http://my.clusterurl.svc

    Obviously, that’s not correct. The call coming from the UI is this:

    my.url.com/api/v1/clusters/http%3A%2F%2Fmy.clusterurl.svc

    Notice the encoding: My Nginx reverse proxy (running on a Raspberry Pi outside my main server) was decoding the request before passing it along to the cluster.

    The question was, why? A quick Google search lead right to their documentation:

    • If proxy_pass is specified with URI, when passing a request to the server, part of a normalized request URI matching the location is replaced by a URI specified in the directive
    • If proxy_pass is specified without URI, a request URI is passed to the server in the same form as sent by a client when processing an original request

    What? Essentially, it means the slash at the end will dictate whether Nginx does anything with the request.

    ## With a slash at the end, the client request is normalized
    location /name/ {
        proxy_pass http://127.0.0.1/remote/;
    }
    
    ## Without a slash, the request is as-is
    location /name/ {
        proxy_pass http://127.0.0.1;
    }

    Removing the slash, the Argo UI was able to load the cluster details correctly.

  • The cascading upgrade continues…

    I mentioned in a previous post that I made the jump to Windows 11 on both my work and home laptops. As it turns out, this is causing me to re-evaluate some of my other systems and upgrade them as needed.

    Old (really old) Firmware

    The HP ProLiant Gen8 Server I have been running for a few years had a VERY old firmware version on it. When I say “very old” I mean circa 2012. For all intents and purposes, it did its job. The few times I use it, however, are the most critical: i.e., the domain controller VMs won’t come up, and I need to use the remote console to log in to the server and restart something.

    This particular version of the iLO firmware worked best in Internet Explorer, particularly for the remote access portion. Additionally, I had never taken the time to create a proper SSL certificate for the iLO interface, which usually meant a few refreshes were required to get in.

    In Windows 11 and Edge, this just was not possible. The security settings prevented the access on the invalid SSL. Additionally, remote access required a Click-once application running .NET Framework 3.5. So even if I got past the invalid SSL (which I did), the remote console would not work.

    Time for an upgrade?

    When I first setup this server in 2018, I vaguely remember looking for and not finding firmware updates for the iLO. Clearly I was mistaken: the Gen8 runs iLO 4, which has firmware updates as recent as April of 2022. After reading through the release notes and installation instructions, I felt pretty confident that this upgrade would solve my issue.

    The upgrade process was pretty easy: extract the .bin firmware from the installer, upload via the iLO interface, and wait a bit for the iLO to restart. At that point, I was up and running with a new and improved interface.

    Solving the SSL Issue

    The iLO generates a self-signed, but backdated SSL certificate. You can customize it, buy only by generating a CSR via the iLO interface, getting a certificate back, and importing that certificate into the iLO. I really did not want to go through the hassle of create a certificate authority, or figure out how to use Let’s Encrypt to fulfill the CSR, so I took a slightly different path.

    1. Generate a self-signed root CA Certificate.
    2. Generate the CSR from the iLO interface
    3. Sign the CSR with the self-signed root CA.
    4. Install the root CA as a Trusted Root Certificate on my local machine.

    This allows me the ability to connect to the iLO interface without getting the SSL errors, which is enough for me.

    A lot has changed in 10 years…

    The iLO 4 interface got a nice facelift in the past 10 years. A new REST API lets me get to some data from my server, including power and thermal data. Most importantly, the Remote Console got an upgrade to an HTML 5 interface, which means I do not have to rely on Internet Explorer anymore. I am pretty happy with the ease of the process, although I do wish I would have known and done this sooner.

  • Breaking an RKE cluster in one easy step

    With the release of Ubuntu’s latest LTS release (22.04, or “Jammy Jellyfish), I wanted to upgrade my Kubernetes nodes from 20.04 to 22.04. What I had hoped would be an easy endeavor turned out to be a weeks-long process with destroyed clusters and, ultimately, an ETCD issue.

    The Hypothesis

    As I viewed it, I had two paths to this upgrade: in-place upgrade on the nodes, or bring up new nodes and decommission the old ones. As the latter represents the “Yellowstone” approach, I chose that one. My plan seemed simple:

    • Spin up new Ubuntu 22.04 nodes using Packer.
    • Add the new nodes to the existing clusters, assigning the new nodes all the necessary roles (I usually have 1 control_plane, 3 etcd, and all are worker
    • Remove the control_plane from the old node and verify connectivity
    • Remove the old nodes (cordon, drain, and remove)

    Internal Cluster

    After updating my Packer scripts for 22.04, I spun up new nodes for my internal cluster, which has an ELK stack for log collection. I added the new nodes without a problem, and thought that maybe I could combine the last two steps and just remove all the nodes at the same time.

    That ended up with the Rancher CLI getting stuck in checking ETCD health. I may have gotten a little impatient and just killed the Rancher CLI process mid-job. This left me with, well, a dead internal cluster. So, I recovered the cluster (see my note on cluster recovery below) and thought I’d try again with my non-production cluster.

    Non-production Cluster

    Some say that the definition of insanity is doing the same thing and expecting a different result. My logic, however, was that I made two mistakes the first time through:

    • Trying to remove the controlplane alongside of the etcd nodes in the same step
    • Killing the RKE CLI command mid-stream

    So I spun up a few new non-production nodes, added them to the cluster, and simply removed controlplane from the old node.

    Success! My new controlplane node took over, and cluster health seemed good. And, in the interest of only changing one variable at a time, I decided to try and remove just one old node from the cluster.

    Kaboom….

    Same issue in recovering the etcd volume. So I recovered the cluster and returned to the drawing board.

    Testing

    At this point, I only had my Rancher/Argo cluster and my production cluster, which houses, among other things, this site. I had no desire for wanton destruction of these clusters, so I setup a test cluster to see if I could replicate my results. I was able to, at which point I turned to the RKE project in Github for help.

    After a few days, someone pointed me to a relatively new Rancher issue describing my predicament. If you read through those various issues, you’ll find that etcd 3.5 has an issue where node removal can corrupt its database, causing issues such as mine. The issue was corrected in 3.5.3.

    I upgraded my RKE CLI and ran another test with the latest rancher Kubernetes version. This time, finally, success! I was able to remove etcd nodes without crashing the cluster.

    Finishing up / Lessons Learned

    Before doing anything, I upgraded all of my clusters to the latest supported Kubernetes version. In my case, this is v1.23.6-rancher1-1. Following the steps above, I was, in fact, able to progress through upgrading both my Rancher/Argo cluster and my production cluster without bringing down the clusters.

    Lessons learned? Well, patience is key (don’t kill cluster management processes mid-effort), but also, sometimes it is worth a test before you try things. Had any of these clusters NOT been my home lab clusters, this process, seemingly simple, would have incurred downtime in more important systems.

    A note on Cluster Recovery

    For both the internal and non-production clusters, I could have scrambled to recover the ETCD volume for that cluster and brought it back to life. But I realized that there was no real valuable data in either cluster. The ELK logs are useful real-time but I have not started down the path of analyzing history, so I didn’t mind losing them. And even those are on my SAN, and the PVCs get archived when no longer in use.

    Instead of a long, drawn out recovery process, I simply stood up brand new clusters, pointed my instance of Argo to them and updated my Argo applications to deploy to the new cluster. Inside of an hour, my apps were back up and running. This is something of a testament to the benefits of storing a cluster’s state in a repository: recreation was nearly automatic.

  • Tech Tip – Azure DevOps Pipelines Newline handling

    Just a quick note: It would seem that somewhere between Friday, April 29, 2022 and Monday, May 2, 2022, Azure DevOps pipelines changed their handling of newlines in YAML literal blocks. The change caused our pipelines to stop executing with the following error:

    While scanning a literal block scalar, found extra spaces in first line

    What caused it? Multi-line, inline block definitions.

    - powershell: |
        
        Write-Host "Azure Fails on this now"
      displayName: Bad Script
      
    - powershell: |
        Write-Host "Azure works with this"
      displayName: Good Script

  • The trip to Windows 11

    The past five weeks have been a blur. Spring soccer is in full swing, and my time at the keyboard has been limited to mostly work. A PC upgrade at work started a trickle-down upgrade, and with things starting to settle it’s probably worth a few notes.

    Swapping out the work laptop

    My old work laptop, a Dell Precision 7510, was about a year over our typical 4 year cycle on hardware. I am usually not one to swap right at 4 years: if the hardware is working for me, I’ll run it till it dies.

    Well, I was clearly killing the 7510: Every so often I would get random shutdowns that were triggered by the thermal protection framework. In other words, to quote Ricky Bobby, “I’m on fire!!!” As it had become more frequent, I put in a request for a new laptop. Additionally, since my 7510 was so old, it couldn’t be redistributed, so I requested to buy it back. My personal laptop, and HP Envy 17, is approaching the 11 year mark, so I believe I’ve used it enough.

    I was provisioned a Dell Precision 7550. As it came to me clean with Windows 10, I figured I’d upgrade it to Windows 11 before I reinstalled my apps and tools. That way, if I broke it, I would know before I wasted my time. The upgrade was pretty easy, and outside of some random issues with MS Teams and audio setup, I was back at work with minimal disruption.

    On to the “new old” laptop

    My company approved the buyback of my old Precision 7510, but I did have to ship it back to have it wiped and prepped. Once I got it back, I turned it on and….. boot media failure.

    Well, crap. As it turns out, the M.2 SSD died on the 7510. So, off to Amazon to pick up a 1TB replacement. A day later, new M.2 in hand, I was off to the races.

    I put in my USB drive with Windows 11 media, booted, and got the wonderful message that my hardware did not meet requirements. As it turns out, my TPM module was an old firmware version (1.2 instead of the required 2.0), and I was running an old processor that is not in the officially supported list for Windows 11.

    So I installed Windows 10 and tried to boot, but, well, that just failed. As it turns out, the BIOS was setup for legacy boot and insecure boot, which I needed for Windows 11. And, after changing the BIOS, my installed version of Windows 10 wouldn’t boot. I suppose I could have taken the time to work on the boot loader to get it working… but I just re-installed Windows 10 again.

    So after the second install of Windows 10, I was able to update the TPM Firmware to 2.0, bypass the CPU check, and install Windows 11. I started installing my standard library of tools, and things seemed great.

    Still on fire

    Windows 11, however, seems to have exacerbated the overheating issue. It came to a head when I tried to install Jetbrains Rider: every install caused the machine to overheat.

    I found a few different solutions in separate forums, but the working solution was to disable the “Turbo Boost” setting in the BIOS. My assumption is that is some sort of overclocking, but removing that setting has stabilized my system.

    Impressions

    So far, so good with Windows 11. Of all the changes, the ability to match contents of the taskbar to the monitor in a multi-monitor display is great, but I am still getting used to that functionality. Organizationally it is accurate, but muscle memory is hard to break.

  • GitOps: Moving from Octopus to … Octopus?

    Well, not really, but I find it a bit cheeky that ArgoCD’s icon is, in fact, an orange octopus.

    There are so many different ways to provision and run Kubernetes clusters that, without some sense of standardization across the organization, Kubernetes can become an operations nightmare. And while a well-run operations environment can allow application developers to focus on their applications, a lack of standards and best practices for container orchestration can divert resources from application development to figuring out how a particular cluster was provisioned or deployed.

    I have spent a good bit of time at work trying to consolidate our company approach to Kubernetes. Part of this is working with our infrastructure teams to level set on what a “bare” cluster looks like. The other part is standardizing how the clusters are configured and managed. For this, a colleague turned me onto GitOps.

    GitOps – Declarative Management

    I spent a LOT of time just trying to understand how all of this works. For me, the thought of “well, in order to deploy I have to change a Git repo” seemed, well, icky. I like the idea that code changes, things are built, tested, and deployed. I had a difficult time separating application code from declarative state. Once I removed that mental block, however, I was bound and determined to move my home environments from Octopus Deploy pipelines to ArgoCD.

    I did not go on this journey alone. My colleague Oleg introduced me to ArgoCD and the External Secrets operator. I found a very detailed article with flow details by Florian Dambrine on Medium.com. Likewise, Burak Kurt showed me how to have Argo manage itself. Armed with all of this information, I started my journey.

    The “short short” version

    In an effort to retain the few readers I have, I will do my best to consolidate all of this into a few short paragraphs. I have two types of applications in my clusters: external tools and home-grown apps.

    External Tools

    Prior to all of this, my external tools were managed by running Helm upgrades periodically with values files saved to my PC. These tools included “cluster tools” which are installed on every cluster, and then things like WordPress and Proget for running the components of this site. Needless to say, this is highly dangerous: had I lost my PC, all those values files would be gone.

    I now have tool definitions stored either in the cluster’s specific Git repo, or in a special “cluster-tools” repository that allows me to define applications that I want installed to all (or some) of my clusters, based on labels in the cluster’s Argo secret. This allows me to update tools by updating the version in my Git repository and committing the change.

    It should be noted that, for these tools, Helm is still used to install/upgrade. More on this later.

    Home Grown Applications

    The home-grown apps had more of a development feel: feature branches push to a test environment, while builds from main get pushed to staging and then, upon my approval in Azure DevOps, pushed to production.

    Previous to conversion, every build produced a new container image and Helm chart, both of which were published to Proget. From there, Octopus Deploy took care of deploying feature builds to the test environment only, and took care of deploying to stage and production based on nudges from Azure DevOps Tasks.

    Using Florian’s described flow, I created a Helmfile repo for each of my projects, which allowed me consolidate the application charts into a single repository. Using Helmfile and Helm, I generate manifests that are then committed into the appropriate cluster’s Git repository. Each Helmfile repository has its own pipeline for generating manifests and committing them to the cluster repository, but my individual project pipelines have gotten very simple: build a new version, and change the Helmfile repository to reflect the new image version.

    Once the cluster’s repository is updated, Argo takes care of syncing the resources.

    Helm versus Manifests

    I noted that I’m currently using Helm for external tools versus generating manifests (albeit from Helm charts, but still generating manifests) for internal applications. As long as you never actually run a helm install command, then Argo will manage the application using manifests generated from the Helm application. However, from what I have seen, if you have previously run helm install, that application hangs around in the cluster. However, that applications history doesn’t change with new versions. So you can get into an odd state where helm list will show older versions than what are actually installed.

    When using a tool like ArgoCD, you want to let it manage your applications from beginning to end. It will keep your cluster cleaner. For the time being, I am defining external tools using Helm templates, but using Helmfile to expand my home-built applications into manifest files.

  • Tech Tip – Turn on forwarded headers in Nginx

    I have been using Nginx as a reverse proxy for some time. In the very first iteration of my home lab, it lived on a VM and allowed me to point my firewall rules to a single target, and then route traffic from there. It has since been promoted to a dedicated Raspberry Pi in my fight with the network gnomes.

    My foray into Kubernetes in the home lab has brought Nginx in as an ingress controller. While there are many options for ingress, Nginx seems the most prevalent and, in my experience, the easiest to standardize on across a multitude of Kubernetes providers. As we drive to define what a standard K8 cluster looks like across our data centers and public cloud providers, Nginx seemed like a natural choice for our ingress provider.

    Configurable to a fault

    The Nginx Ingress controller is HIGHLY configurable. There are cluster-wide configuration settings that can be controlled through ConfigMap entries. Additionally, annotations can be used on specific ingress objects to control behavior on individual ingress.

    As I worked with one team to setup Duende’s Identity Server, we started running into issues with the identity server endpoints using http instead of https in its discovery endpoints (such as /.well-known/openid-configuration. Most of our research suggest that the X-Forwarded-* headers needed to be configured (which we did), but we were still seeing the wrong scheme in those endpoints.

    It was a weird problem: I had never run into this issue in my own Identity Server instance, which is running in my home Kubernetes environment. I figured it had to do with an Nginx setting, but had a hard time figuring out which one.

    One blog post pointed me in the right direction. Our Nginx ingress install did not have the use-forwarded-headers setting configured in the ConfigMap, which meant the X-Forwarded-* headers were not being passed to the pod. A quick change of our deployment project, and the openid-configuration endpoint returned the appropriate schemes.

    For reference, we are using the ingress-nginx helm chart. Adding the following to our values file solved the issue:

    controller:
      replicaCount: 2
      
      service:
        ... several service settings
      config:
        use-forwarded-headers: "true"

    Investigation required

    What I do not yet know is, whether or not I randomly configured this at home and just forgot about it, or if it is a default of the Rancher Kubernetes Engine (RKE) installer. I use RKE at home to stand up my clusters, and one of the add-ons I have it configure is ingress with Nginx. Either I have settings in my RKE configuration to forward headers or it’s a default of RKE…. Unfortunately, I am at a soccer tournament this weekend, so the investigation will have to wait until I get home.

    Update:

    Apparently I did know about use-forwarded-headers earlier: it was part of the options I had set in my home Kubernetes clusters. One of many things I have forgotten.