Tag: ArgoCD

  • Simplifying Internal Routing

    Centralizing Telemetry with Linkerd Multi-Cluster

    Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.

    This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?

    Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.

    The Problem: Telemetry Taking the Scenic Route

    My setup looked like this:

    Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway

    Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.

    Nonproduction cluster: Same deal, different tenant ID

    Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.

    This meant:

    – An unnecessary hop through the Nginx proxy layer

    – Extra TLS handshakes that didn’t add security value between internal services

    – DNS resolution for external service names when direct cluster DNS would suffice

    – One more component in the path that could cause issues

    The Solution: Hub-and-Spoke with Linkerd Multi-Cluster

    Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.

    For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.

    The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.

    Gateway Mode vs. Flat Network

    (Spoiler: Gateway Mode Won)

    Linkerd offers two approaches for multi-cluster communication:

    Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.

    Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.

    The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.

    The Configuration: Getting the Helm Values Right

    Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:

    linkerd-multicluster:
      gateway:
        enabled: true
        port: 4143
        serviceType: NodePort
        nodePort: 30143
        probe:
          port: 4191
          nodePort: 30191
    
      # Grant access to service accounts from other clusters
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction

    And for the production/nonproduction clusters:

    linkerd-multicluster:
      gateway:
        enabled: false  # No gateway needed here
    
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local

    The Link: Connecting Clusters Without Auto-Discovery

    Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:

    linkerd multicluster link --cluster-name internal --gateway-addresses internal.example.com:30143

    But that command tries to do DNS lookups on the combined hostname+port string, which fails spectacularly. The fix was simple once I found it:

    linkerd multicluster link \
      --cluster-name internal \
      --gateway-addresses tfx-internal.gerega.net \
      --gateway-port 30143 \
      --gateway-probe-port 30191 \
      --api-server-address https://cp-internal.gerega.net:6443 \
      --context=internal | kubectl apply -f - --context=production

    Separating --gateway-addresses and --gateway-port made all the difference.

    I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.

    Service Export: Making Services Visible Across Clusters

    Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.

    For the Loki gateway (and similarly for Mimir and Tempo):

    gateway:
      service:
        labels:
          mirror.linkerd.io/exported: "true"

    Once the services were exported, they appeared in the production and nonproduction clusters with an `-internal` suffix:

    loki-gateway-internal.monitoring.svc.cluster.local

    mimir-gateway-internal.monitoring.svc.cluster.local

    tempo-gateway-internal.monitoring.svc.cluster.local

    Grafana Alloy: Switching to Mirrored Services

    The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:

    Before:

    loki.write "default" {
      endpoint {
        url = "https://loki.mattgerega.net/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    After:

    loki.write "default" {
      endpoint {
        url = "http://loki-gateway-internal.monitoring.svc.cluster.local/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.

    But wait—there’s one more step.

    The Linkerd Injection Gotcha

    Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.

    This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:

    alloy:
      controller:  # Not alloy.alloy!
        podAnnotations:
          linkerd.io/inject: enabled
      alloy:
        # ... other config

    Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:

    – `linkerd-proxy` (the magic sauce)

    – `alloy` (the telemetry collector)

    – `config-reloader` (for hot config reloads)

    Checking the gateway logs confirmed traffic was flowing:

    INFO inbound:server:gateway{dst=loki-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.5.4:8080
    INFO inbound:server:gateway{dst=mimir-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.9.18:8080
    INFO inbound:server:gateway{dst=tempo-gateway.monitoring.svc.cluster.local:4317}: Adding endpoint addr=10.42.10.13:4317

    Known Issues: Probe Health Checks

    There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.

    From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.

    What I Learned

    1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.

    2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.

    3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.

    4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.

    5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.

    The Result

    Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.

    I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.

    The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.

  • ArgoCD panicked a little…

    I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.

    Whatever you do, don’t panic!

    Well, unless of course you are ArgoCD.

    I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.

    Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.

    Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:

    Recovered from panic: runtime error: invalid memory address or nil pointer

    A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.

    Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.

    Versioning….

    When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.

    My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.

    This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.

  • Re-configuring Grafana Secrets

    I recently fixed some synchronization issues that had been silently plaguing some of the monitoring applications I had installed, including my Loki/Grafana/Tempo/Mimir stack. Now that the applications are being updated, I ran into an issue with the latest Helm chart’s handling of secrets.

    Sync Error?

    After I made the change to fix synchronization of the Helm charts, I went to sync my Grafana chart, but received a sync error:

    Error: execution error at (grafana/charts/grafana/templates/deployment.yaml:36:28): Sensitive key 'database.password' should not be defined explicitly in values. Use variable expansion instead.

    I certainly didn’t change anything in those files, and I am already using variable expansion in the values.yaml file anyway. What does that mean? Basically, in the values.yaml file, I used ${ENV_NAME} in areas where I had a secret value, and told Grafana to expand environment variables into the configuration.

    The latest version of Helm doesn’t seem to like this. It views ANY value in secret fields to be bad. A search of the Grafana Helm Chart repo’s issues list yielded someone with a similar issue and a comment with a link to another comment that is the recommended solution.

    Same Secret, New Name

    After reading through the comment’s suggestion and Grafana’s documentation on overriding configuration with environment variables, I realized the fix was pretty easy.

    I already had a Kubernetes secret being populated from Hashicorp Vault with my secret values. I also already had envFromSecret set in the values.yaml to instruct the chart to use my secret. And, through some dumb luck, two of the three values were already named using the standards in Grafana’s documentation.

    So the “fix” was to simply remove the secret expansions from the values.yaml file, and rename one of the secretKey values so that it matched Grafana’s environment variable template. You can see the diff of the change in my Github repository.

    With that change, the Helm chart generated correctly, and once Argo had the changes in place, everything was up and running.

  • Synced, But Not: ArgoCD Differencing Configuration

    Some of the charts in my Loki/Grafana/Tempo/Mimir stack have an odd habit of not updating correctly in ArgoCD. I finally got tired of it and fixed it… I’m just not 100% sure how.

    Ignoring Differences

    At some point in the past, I had customized a few of my Application objects with ignoreDifferences settings. It was meant to tell ArgoCD to ignore fields that are managed by other things, and could change from the chart definition.

    Like what, you might ask? Well, the external-secrets chart generates it’s own caBundle and sets properties on a ValidatingWebhookConfiguration object. Obviously, that’s managed by the controller, and I don’t want to mess with it. However, I also don’t want ArgoCD to report that chart as Out of Sync all the time.

    So, as an example, my external-secrets application looks like this:

    project: cluster-tools
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-argo'
      path: cluster-tools/tools/external-secrets
      targetRevision: main
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: external-secrets
    syncPolicy:
      syncOptions:
        - CreateNamespace=true
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: '*'
        kind: '*'
        managedFieldsManagers:
          - external-secrets

    And that worked just fine. But, with my monitor stack, well, I think I made a boo-boo.

    Ignoring too much

    When I looked at the application differences for some of my Grafana resources, I noticed that the live vs desired image was wrong. My live image was older than the desired one, and yet, the application wasn’t showing as out of sync.

    At this point, I suspected ignoreDifferences was the issue, so I looked at the Application manifest. For some reason, my monitoring applications had an Application manifest that looked like this:

    project: external-apps
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-internal-cluster'
      path: external/monitor/grafana
      targetRevision: main
      helm:
        valueFiles:
          - values.yaml
        version: v3
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: monitoring
    syncPolicy:
      syncOptions:
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: "*"
        kind: "*"
        managedFieldsManagers:
        - argocd-controller
      - group: '*'
        kind: StatefulSet
        jsonPointers:
          - /spec/persistentVolumeClaimRetentionPolicy
          - /spec/template/metadata/annotations/'kubectl.kubernetes.io/restartedAt'

    Notices the part where I am ignoring managed fields from argocd-controller. I have no idea why I added that, but, it looks a little “all inclusive” for my tastes, and it was ONLY present in the ApplicationSet for my LGTM stack. So I commented it out.

    Now We’re Cooking!

    Lo and behold, ArgoCD looked at my monitoring stack and said “well, you have some updates, don’t you!” I spent the next few minutes syncing those applications individually. Why? There are a lot of hard working pods in those applications, I don’t like to cycle them all at once.

    I searched through my posts and some of my notes, and I honestly have no idea why I decided I should ignore all fields managed by argocd-controller. Needless to say, I will not be doing that again.

  • More GitOps Fun!

    I have been curating some scripts that help me manage version updates in my GitOps repositories… It’s about time they get shared with the world.

    What’s Going On?

    I manage the applications in my Kubernetes clusters using Argo CD and a number of Git repositories. Most of the ops- repositories act as “desired state” repositories.

    As part of this management, I have a number of external tools running in my clusters that are installed using their Helm charts. Since I want to keep my installs up to date, I needed a way to update the Helm chart versions as new releases came out.

    However.. some external tools do not have their own Helm charts. For that, I have been using a Helm library chart from bjw-s. In that case, I have had to manually find new releases and update my values.yaml file.

    While I have had the Helm chart version updates automated for some time, I just recently got around to updating the values.yaml file from external sources. Now is a good time to share!

    The Scripts

    I put the scripts in the ops-automation repository in the Spydersoft organization. I’ll outline the basics of each script, but if you are interested in the details, check out the scripts themselves.

    It is worth nothing that these scripts require the git and helm command line tools to be installed, in addition to the Powershell Yaml module.

    Also, since I manage more than one repository, all of these scripts are designed to be given a basePath and then a list of directory names for the folders that are the Git repositories I want to update.

    Update-HelmRepositoryList

    This script iterates through the given folders to find the chart.yaml files in it. For every dependency in the found chart files, it adds the repository to the local helm if the URL does not already exist.

    Since I have been running this on my local machine, I only have to do this once. But, on a build agent, this script should be run every time to make sure the repository list contains all the necessary repositories for an update.

    Update-HelmCharts

    This script iterates through the given folders to find the chart.yaml files in it. For every dependency, the script determines if there is an updated version of the dependency available.

    If there is an update available, the Chart.yaml file is updated, and helm dependency update is run to update the Chart.lock file. Additionally, commit comments are created to note the version changes.

    For each chart.yaml file, a call to Update-FromAutoUpdate will be made to make additional updates if necessary.

    Update-FromAutoUpdate

    This script looks for a file called auto-update.json in the path given. The file has the following format:

    {
        "repository": "redis-stack/redis-stack",
        "stripVFromVersion": false,
        "tagPath": "redis.image.tag"
    }

    The script looks for the latest release from the repository in Github, using tag_name from Github as the version. If the latest release is newer than the current tagPath in values.yaml, the script then updates the tagPath in the values.yaml file to the new version. The script returns an object indicating whether or not an update was made, as well as a commit comment indicating the version jump.

    Right now, the auto-update only works for images that come from Github releases. I have one item (Proget) that needs to search a docker API directly, but that will be a future enhancement.

    Future Tasks

    Now that these are automated tasks, I will most likely create an Azure Pipeline that runs weekly to get these changes made and committed to Git.

    I have Argo configured to not auto-sync these applications, so even though the changes are made in Git, I still have to manually apply the updates. And I am ok with that. I like to stagger application updates, and, in some cases, make sure I have the appropriate backups before running an update. But this gets me to a place where I can log in to Argo and sync apps as I desire.

  • Tech Tip – You should probably lock that up…

    I have been running in to some odd issues with ArgoCD not updating some of my charts, despite the Git repository having an updated chart version. As it turns out, my configuration and lack of Chart.lock files seems to have been contributing to this inconsistency.

    My GitOps Setup

    I have a few repositories that I use as source repositories for Argo. The contain mix of my own resource definition files, which are raw manifest files, and external Helm charts. The external Helm charts use an umbrella chart to allow me the ability to add supporting resources (like secrets). My Grafana chart is a great example of it.

    Prior to this, I was not including the Chart.lock file in the repository. This made it easier to update the version in the Chart.yaml file without having to run a helm dependency update to update the lock file. I have been running this setup for at least a year, and I never really noticed much problem until recently. There were a few times where things would not update, but nothing systemic.

    And then it got worse

    More recently, however, I noticed that the updates weren’t taking. I saw the issue with both the Loki and Grafana charts: The version was updated, but Argo was looking at the old version.

    I tried hard refreshes on the Applications in Argo, but nothing seemed to clear that cache. I poked around in the logs and noticed that Argo runs helm dependency build, not helm dependency update. That got me thinking “What’s the difference?”

    As it turns out, build operates using the Chart.lock file if it exists, otherwise it acts like upgrade. upgrade uses the Chart.yaml file to install the latest.

    Since I was not committing my Chart.lock file, it stands to reason that somewhere in Argo there is a cached copy of a Chart.lock file that was generated by helm dependency build. Even though my Chart.yaml was updated, Argo was using the old lock file.

    Testing my hypothesis

    I committed a lock file 😂! Seriously, I ran helm dependency update locally to generate a new lock file for my Loki installation and committed it to the repository. And, even though that’s the only file that changed, like magic, Loki determined it needed an update.

    So I need to lock it up. But, why? Well, the lock file exists to ensure that subsequent builds use the exact version you specify, similar to npm and yarn. Just like npm and yarn, helm requires a command to be run to update libraries or dependencies.

    By not committing my lock file, the possibility exists that I could get a different version than I intended or, even worse, get a spoofed version of my package. The lock file maintains a level of supply chain security.

    Now what?

    Step 1 is to commit the missing lock files.

    At both work and home I have Powershell scripts and pipelines that look for potential updates to external packages and create pull requests to get those updates applied. So step 2 is to alter those scripts to run helm dependency update when the Chart.yaml is updated, which will update the Chart.lock and alleviate the issue.

    I am also going to dig into ArgoCD a little bit to see where these generated Chart.lock values could be cached. In testing, the only way around it was to delete the entire ApplicationSet, so I’m thinking that the ApplicationSet controller may be hiding some data.

  • Rollback saved my blog!

    As I was upgrading WordPress from 6.2.2 to 6.3.0, I ran into a spot of trouble. Thankfully, ArgoCD rollback was there to save me.

    It’s a minor upgrade…

    I use the Bitnami WordPress chart as the template source for Argo to deploy my blog to one of my Kubernetes clusters. Usually, an upgrade is literally 1, 2, 3:

    1. Get the latest chart version for the WordPress Bitnami chart. I have a Powershell script for that.
    2. Commit the change to my ops repo.
    3. Go into ArgoCD and hit Sync

    That last one caused some problems. Everything seemed to synchronize, but the WordPress pod stopped at the connect to database section. I tried restarting the pod, but nothing.

    Now, the old pod was still running. So, rather than mess with it, I used Argo’s rollback functionality to roll the WordPress application back to it’s previous commit.

    What happened?

    I’m not sure. You are able to upgrade WordPress from the admin panel, but, well, that comes at a potential cost: If you upgrade the database as part of the WordPress upgrade, but then you “lose” the pod, well, you lose the application upgrade but not the database upgrade, and you are left in a weird state.

    So, first, I took a backup. Then, I started poking around in trying to run an upgrade. That’s when I ran into this error:

    Unknown command "FLUSHDB"

    I use the WordPress Redis Object Cache to get that little “spring” in my step. It seemed to be failing on the FLUSHDB command. At that point, I was stuck in a state where the application code was upgraded but the database was not. So I restarted the deployment and got back to 6.2.2 for both application code and database.

    Disabling the Redis Cache

    I tried to disable the Redis plugin, and got the same FLUSHDB error. As it turns out, the default Bitnami Redis chart disables these commands, but it would seem that the WordPress plugin still wants them.

    So, I enabled the commands in my Redis instance (a quick change in the values files) and then disable the Redis Cache plugin. After that, I was able to upgrade to WordPress 6.3 through the UI.

    From THERE, I clicked Sync in ArgoCD, which brought my application pods up to 6.3 to match my database. Then I re-enabled the Redis Plugin.

    Some research ahead

    I am going to check with the maintainers of the Redis Object Cache plugin. If they are relying on commands that are disabled by default, it most likely caused some issues in my WordPress upgrade.

    For now, however, I can sleep under the warm blanket of Argo roll backs!

  • Automated RKE2 Cluster Management

    One of the things I like about cloud-hosted Kubernetes solutions is that they take the pain out of node management. My latest home lab goal was to replicate some of that functionality with RKE2.

    Did I do it? Yes. Is there room for improvement? Of course, its a software project.

    The Problem

    With RKE1, I have a documented and very manual process for replacing nodes in my clusters. For RKE1, it shapes up like this:

    1. Provision a new VM.
    2. Add a DNS Entry for the new VM.
    3. Edit the cluster.yml file for that cluster, adding the new VM with the appropriate roles to match the outgoing node.
    4. Run rke up
    5. Edit the cluster.yml file for that cluster to remove the old VM.
    6. Run rke up
    7. Modify the cluster’s ingress-nginx settings, adding the new external IP and removing the old one.
    8. Modify my reverse proxy to reflect the IP Changes
    9. Delete the old VM and its DNS entries.

    Repeat the above process for every node in the cluster. Additionally, because the nodes could have slightly different docker versions or updates, I often found myself provisioning a whole set of VMs at a time and going through this process for all the existing nodes at once. The process was fraught with problems, not the least of which is me remembering things that I had to do.

    A DNS Solution

    I wrote a wrapper API to manage Windows DNS settings, and built calls to that wrapper into my Unifi Controller API so that, when I provision a new machine or remove an old one, it will add or remove the fixed IP from Unifi AND add or remove the appropriate DNS record for the machine.

    Since I made DNS entries easier to manage, I also came up with a DNS naming scheme to help manage cluster traffic:

    1. Every control plane node gets an A record with cp-<cluster name>.gerega.net. This lets my kubeconfig files remain unchanged, and traffic is distributed across the control plane nodes via round robin DNS.
    2. Every node gets an A record with tfx-<cluster name>.gerega.net. This allows me to configure my external reverse proxy to use this hostname instead of an individual IP list. See below for more on this from a reverse proxy perspective.

    That solved most of my DNS problems, but I still had issues with the various rke up runs and compatibility worries.

    Automating with RKE2

    The provisioning process for RKE2 is much simpler than that for RKE1. I was able to shift the cluster configuration into the Packer provisioning scripts, which allowed me to do more within the associated Powershell scripts. This, coupled with the DNS standards above, mean that I could run one script and end up with a completely provisioned RKE2 cluster.

    I quickly realized that adding and removing clusters to/from the RKE2 clusters was equally easy. Adding nodes to the cluster simply meant provisioning a new VM with the appropriate scripting to install RKE2 and add it to the existing control plane. Removing nodes from the cluster was simple:

    1. Drain the node (kubectl drain)
    2. Delete the node from the cluster (kubectl delete node/<node name>.
    3. Delete the VM (and its associated DNS).

    As long as I had at least one node with the server role running at all times, things worked fine.

    With RKE2, though, I decided to abandon my ingress-nginx installations in favor of using RKE2’s built-in Nginx Ingress. This allows me to skip managing the cluster’s external IPs, as the RKE cluster’s installer handles that for me.

    Proxying with Nginx

    A little over a year ago I posted my updated network diagram, which introduced a hardware proxy in the form of a Raspberry Pi running Nginx. That little box is a workhorse, and plans are in place for a much needed upgrade. However, in the mean time, it works.

    My configuration was heavily IP based: I would configure upstream blocks with each cluster node’s IP set, and then my sites would be configured to proxy to those IPs. Think something like this:

    upstream cluster1 {
      server 10.1.2.50:80;
      server 10.1.2.51:80;
      server 10.1.2.52:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    The issue here is, every time I add or remove a cluster node, I have to mess with this file. My DNS server is setup for round robin DNS, which means I should be able to add new A records with the same host name, and the DNS will cycle through the different servers.

    My worry, though, was the Nginx reverse proxy. If I configure the reverse proxy to a single DNS, will it cache that IP? Nothing to do but test, right? So I changed my configuration as follows:

    upstream cluster1 {
      server tfx-cluster1.gerega.net:80;
    }
    
    server {
       ## server settings
    
       location / {
         proxy_pass http://cluster1;
         # proxy settings
       }
    }

    Everything seemed to work, but how can I know it worked? For that, I dug into my Prometheus metrics.

    Finding where my traffic is going

    I spent a bit of time trying to figure out which metrics made the most sense to see the number of requests coming through each Nginx controller. As luck would have it, I always put a ServiceMonitor on my Nginx applications to make sure Prometheus is collecting data.

    I dug around in the in the Nginx metrics and found nginx_ingress_controller_requests. With some experimentation, I found this query:

    sum(rate(nginx_ingress_controller_requests{cluster="internal"}[2m])) by (instance)

    Looks easy, right? Basically, look at the sum of the rate of incoming requests by instance for a given time. Now, I could clean this up a little and add some rounding and such, but I really did not care about the number: I wanted to make sure that the request across the instances were balanced effectively. I was not disappointed:

    Rate of Incoming Request

    Each line is an Nginx controller pod in my internal cluster. Visually, things look to be balanced quite nicely!

    Yet Another Migration

    With the move to RKE2, I made more work for myself: I need to migrate my clusters from RKE1 to RKE2. With Argo, the migration should be pretty easy, but still, more home lab work.

    I also came out of this with a laundry list of tech tips and other long form posts… I will be busy over the next few weeks.

  • Nothing says “Friday Night Fun” like crashing an RKE Cluster!

    Yes… I crashed my RKE clusters in a big way yesterday evening, and I spent a lot of time getting them back. I learned a few things in the process, and may have gotten the kickstart I need to investigate new Kubernetes flavors.

    It all started with an upgrade…

    All I wanted to do was go from Kubernetes 1.24.8 to 1.24.9. It seems a simple ask. I downloaded the new RKE command line tool (version 1.4.2), updated my cluster.yaml file, and ran rke up. The cluster upgraded without errors… but all the pods were in an error state. I detailed my findings in a Github issue, so I will not repeat them here. Thankfully, I was able to downgrade, and things started working.

    Sometimes, when I face these types of situations, I’ll stand up a new cluster to test the upgrade/downgrade process. I figured that would be a good idea, so I kicked off a new cluster provisioning script.

    Now, in recent upgrades, sometimes an upgrade of the node is required to make the Kubernetes upgrade run smoothly. So, on my internal cluster, I attempted the upgrade to 1.24.9 again, and then upgraded all of my nodes with an apt update && apt upgrade -y. That seemed to work, the pods came back online, so I figured I would try with production… This is where things went sideways.

    First, I “flipped” the order, and I upgraded the nodes first. Not only did this put all of the pods in an error state, but the upgrade took me to Docker version 23, which RKE doesn’t support. So there was no way to run rke up, even to downgrade to another version. I was, well, up a creek, as they say.

    I lucked out

    Luckily, earlier in the day I had provisioned three machines and created a small non-production cluster to test the issue I was seeing in RKE. So I had an empty Kubernetes 1.24.9 cluster running. With Argo, I was able to “transfer” the workloads from production to non-production simply by changing the ApplicationSet/Application target. The only caveat was that I had to copy files around on my NFS to get them in the correct place. I managed to get all this done and only register 1 hour and fifty four minutes of downtime, which, well, is not bad.

    Cleaning Up

    Now, the nodes for my new “production” cluster were named nonprod, and my OCD would never let that stand. So I provisioned three new nodes, created a new production cluster, and transferred workloads to the new cluster. Since I don’t have auto-prune set, when I changed the ApplicationSet/Application cluster to the new one, the old applications stayed running. This allowed me to get things set up on the new cluster and then cutover on the reverse proxy with no downtime.

    There was still the issue of the internal cluster. Sure, the pods were running, but on nodes with Docker 23, which is not supported. I had HOPED that I could provision a new set of nodes, add them to the cluster, and remove the old ones. I had no such luck.

    The RKE command line tool will not work on nodes with docker 23. So, using the nodes I provisioned, I created yet another new cluster, and went about the process of transferring my internal tools workloads to it.

    This was marginally more difficult, because I had to manually install Nginx Ingress and Argo CD using Helm before I could cutover to the new ArgoCD and let the new one manage the rest of the conversion. However, as all of my resources are declaratively defined in Git repositories, the move was much easier than reinstalling everything from scratch.

    Lessons Learned

    For me, RKE upgrades have been flaky the last few times. The best way ensure success is to cycle new, fully upgraded nodes with docker 20.10 into the cluster, remove the old ones, and then upgrade. Any other method and I have run into issues.

    Also, I will NEVER EVER run apt upgrade on my nodes again. I clearly do not have my application packages pinned correctly, which means in run the risk of getting an invalid version of Docker.

    I am going to start investigating other Kubernetes flavors. I like the simplicity that RKE 1 provides, but the response from the community is slow if at all. I may stand up a few small clusters just to see which ones make the most sense for the lab. I need something that is easy to keep updated, and RKE1 is not fitting that bill anymore.

  • What’s in a (Release) Name?

    With Rancher gone, one of my clusters was dedicated to running Argo and my standard cluster tools. Another cluster has now become home for a majority of the monitoring tools, including the Grafana/Loki/Mimir/Tempo stack. That second cluster was running a little hot in terms of memory and CPU. I had 6 machines running what 4-5 machines could be doing, so a consolidation effort was in order. The process went fairly smoothly, but a small hiccup threw me for a frustrating loop.

    Migrating Argo

    Having configured Argo to manage itself, I was hoping that a move to a new cluster would be relatively easy. I was not disappointed.

    For reference:

    • The ops cluster is the cluster that was running Rancher and is now just running Argo and some standard cluster tools.
    • The internal cluster is the cluster that is running my monitoring stack and is the target for Argo in this migration

    Migration Process

    • I provisioned two new nodes for my internal cluster, and added them as workers to the cluster using the rke command line tool.
    • To prevent problems, I updated Argo CD to the latest version and disabled auto-sync on all apps.
    • Within my ops-argo repository, I changed the target cluster of my internal applications to https://kubernetes.default.svc. Why? Argo treats the cluster that it is installed a bit differently than external clusters. Since I was moving Argo on to my internal cluster, the references to my internal cluster needed to change.
    • Exported my ArgoCD resources using the Argo CD CLI. I was not planning on using the import, however, because I wanted to see how this process would work.
    • I modified my external Nginx proxy to point my Argo address from the ops cluster to the internal cluster. This essentially locked everything out of the old Argo instance, but let it run on the cluster should I need to fall back.
    • Locally, I ran helm install argocd . -n argo --create-namespace from my ops-argo repository. This installed Argo on my internal cluster in the argo namespace. I grabbed the newly generated admin password and saved it in my password store.
    • Locally, I ran helm install argocd-apps . -n argo from my ops-argo repository. This installs the Argo Application and Project resources which serve as the “app of apps” to bootstrap the rest of the applications.
    • I re-added my nonproduction and production clusters to be managed by Argo using the argocd cluster add command. As the URLs for the cluster were the same as they were in the old cluster, the apps matched up nicely.
    • I re-added the necessary labels to each cluster’s ArgoCD secret. This allows the cluster generator to create applications for my external cluster tool. I detailed some of this in a previous article, and the ops-argo repository has the ApplicationSet definitions to help.

    And that was it… Argo found the existing applications on the other clusters, and after some re-sync to fix some labels and annotations, I was ready to go. Well, almost…

    What happened to my metrics?

    Some of my metrics went missing. Specifically, many of those that were coming from the various exporters in my Prometheus install. When I looked at Mimir, the metrics just stopped right around the time of my Argo upgrade and move. I checked the local Prometheus install, and noticed that those targets were not part of the service discovery page.

    I did not expect Argo’s upgrade to change much, so I did not take notice of the changes. So, I was left with digging around into my Prometheus and ServiceMonitor instances to figure out why they are not showing.

    Sadly, this took way longer than I anticipated. Why? There were a few reasons:

    1. I have a home improvement project running in parallel, which means I have almost no time to devote to researching these problems.
    2. I did not have Rancher! One of the things I did use Rancher for was to easily view the different resources in the cluster and compare. I finally remembered that I could use Lens for a GUI into my clusters, but sadly, it was a few days into the ordeal.
    3. Everything else was working, I was just missing a few metrics. On my own personal severity level, this was low. On the other hand, it drove my OCD crazy.

    Names are important

    After a few days of hunting around, I realized that the ServiceMonitor’s matchLabels did not match the labels on the Service objects themselves. Which was odd, because, I had not changed anything in those charts, and they are all part of the Bitnami kube-prometheus Helm chart that I am using.

    As I poked around, I realized that I was using the releaseName property on the Argo Application spec. After some searching, I found the disclaimer on Argo’s website about using releaseName. As it turns out, this disclaimer describes exactly the issue that I was experiencing.

    I spent about a minute to see if I could fix it without removing the releaseName properties from my cluster tools, I realized that the easiest path was to remove that releaseName property from the cluster tools that used it. That follows the guidance that Argo suggests, and keeps my configuration files much cleaner.

    After removing that override, the ServiceMonitor resources were able to find their associated services, and showed up in Prometheus’ service discovery. With that, now my OCD has to deal with a gap in metrics…

    Missing Node Exporter Metrics