Tag: Kubernetes

  • Simplifying Internal Routing

    Centralizing Telemetry with Linkerd Multi-Cluster

    Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.

    This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?

    Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.

    The Problem: Telemetry Taking the Scenic Route

    My setup looked like this:

    Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway

    Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.

    Nonproduction cluster: Same deal, different tenant ID

    Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.

    This meant:

    – An unnecessary hop through the Nginx proxy layer

    – Extra TLS handshakes that didn’t add security value between internal services

    – DNS resolution for external service names when direct cluster DNS would suffice

    – One more component in the path that could cause issues

    The Solution: Hub-and-Spoke with Linkerd Multi-Cluster

    Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.

    For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.

    The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.

    Gateway Mode vs. Flat Network

    (Spoiler: Gateway Mode Won)

    Linkerd offers two approaches for multi-cluster communication:

    Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.

    Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.

    The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.

    The Configuration: Getting the Helm Values Right

    Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:

    linkerd-multicluster:
      gateway:
        enabled: true
        port: 4143
        serviceType: NodePort
        nodePort: 30143
        probe:
          port: 4191
          nodePort: 30191
    
      # Grant access to service accounts from other clusters
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction

    And for the production/nonproduction clusters:

    linkerd-multicluster:
      gateway:
        enabled: false  # No gateway needed here
    
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local

    The Link: Connecting Clusters Without Auto-Discovery

    Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:

    linkerd multicluster link --cluster-name internal --gateway-addresses internal.example.com:30143

    But that command tries to do DNS lookups on the combined hostname+port string, which fails spectacularly. The fix was simple once I found it:

    linkerd multicluster link \
      --cluster-name internal \
      --gateway-addresses tfx-internal.gerega.net \
      --gateway-port 30143 \
      --gateway-probe-port 30191 \
      --api-server-address https://cp-internal.gerega.net:6443 \
      --context=internal | kubectl apply -f - --context=production

    Separating --gateway-addresses and --gateway-port made all the difference.

    I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.

    Service Export: Making Services Visible Across Clusters

    Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.

    For the Loki gateway (and similarly for Mimir and Tempo):

    gateway:
      service:
        labels:
          mirror.linkerd.io/exported: "true"

    Once the services were exported, they appeared in the production and nonproduction clusters with an `-internal` suffix:

    loki-gateway-internal.monitoring.svc.cluster.local

    mimir-gateway-internal.monitoring.svc.cluster.local

    tempo-gateway-internal.monitoring.svc.cluster.local

    Grafana Alloy: Switching to Mirrored Services

    The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:

    Before:

    loki.write "default" {
      endpoint {
        url = "https://loki.mattgerega.net/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    After:

    loki.write "default" {
      endpoint {
        url = "http://loki-gateway-internal.monitoring.svc.cluster.local/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.

    But wait—there’s one more step.

    The Linkerd Injection Gotcha

    Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.

    This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:

    alloy:
      controller:  # Not alloy.alloy!
        podAnnotations:
          linkerd.io/inject: enabled
      alloy:
        # ... other config

    Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:

    – `linkerd-proxy` (the magic sauce)

    – `alloy` (the telemetry collector)

    – `config-reloader` (for hot config reloads)

    Checking the gateway logs confirmed traffic was flowing:

    INFO inbound:server:gateway{dst=loki-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.5.4:8080
    INFO inbound:server:gateway{dst=mimir-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.9.18:8080
    INFO inbound:server:gateway{dst=tempo-gateway.monitoring.svc.cluster.local:4317}: Adding endpoint addr=10.42.10.13:4317

    Known Issues: Probe Health Checks

    There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.

    From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.

    What I Learned

    1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.

    2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.

    3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.

    4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.

    5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.

    The Result

    Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.

    I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.

    The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.

  • Modernizing the Gateway

    From NGINX Ingress to Envoy Gateway

    As with any good engineer, I cannot leave well enough alone. Over the past week, I’ve been working through a significant infrastructure modernization across my home lab clusters – migrating from NGINX Ingress to Envoy Gateway and implementing the Kubernetes Gateway API. This also involved some necessary housekeeping with chart updates and a shift to Server-Side Apply for all ArgoCD-managed resources.

    Why Change?

    The timing couldn’t have been better. In November 2024, the Kubernetes SIG Network and Security Response Committee announced that Ingress NGINX will be retired in March 2026. The project has struggled with insufficient maintainer support, security concerns around configuration snippets, and accumulated technical debt. After March 2026, there will be no further releases, security patches, or bug fixes.

    The announcement strongly recommends migrating to the Gateway API, described as “the modern replacement for Ingress.” This validated what I’d already been considering – the Gateway API provides a more standardized, vendor-neutral approach with better separation of concerns between infrastructure operators and application developers.

    Envoy Gateway, being a CNCF project built on the battle-tested Envoy proxy, seemed like a natural choice for this migration. Plus, it gave me an excuse to finally move off Traefik, which was… well, let’s just say it was time for a change.

    The Migration Journey

    The migration happened in phases across my ops-argoops-prod-cluster, and ops-nonprod-cluster repositories. Here’s what changed:

    Phase 1: Adding Envoy Gateway

    I started by adding Envoy Gateway as a cluster tool, complete with its own ApplicationSet that deploys to clusters labeled with spydersoft.io/envoy-gateway: "true". The deployment includes:

    • GatewayClass and Gateway resources: Defined a main gateway that handles traffic routing
    • EnvoyProxy configuration: Set up with a static NodePort service for consistent external access
    • ClientTrafficPolicy: Configured to properly handle forwarded headers – crucial for preserving client IP information through the proxy chain

    The Envoy Gateway deployment lives in the envoy-gateway-system namespace and exposes services via NodePort 30080 and 30443, making it easy to integrate with my existing network setup.

    Phase 2: Migrating Applications to HTTPRoute

    This was the bulk of the work. Each application needed its Ingress resource replaced with an HTTPRoute. The new Gateway API resources are much cleaner. For example, my blog (www.mattgerega.com) went from an Ingress definition to this:

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: wp-mattgerega
      namespace: sites
    spec:
      parentRefs:
        - name: main
          namespace: envoy-gateway-system
      hostnames:
        - www.mattgerega.com
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
            - name: wp-mattgerega-wordpress
              port: 80
    

    Much more declarative and expressive than the old Ingress syntax.

    I migrated several applications across both production and non-production clusters:

    • Gravitee API Management
    • ProGet (my package management system)
    • n8n and Node-RED instances
    • Linkerd-viz dashboard
    • ArgoCD (which also got a GRPCRoute for its gRPC services)
    • Identity Server (across test and stage environments)
    • Tech Radar
    • Home automation services (UniFi client and IP manager)

    Phase 3: Removing the Old Guard

    Once everything was migrated and tested, I removed the old ingress controller configurations. This cleanup happened across all three repositories:

    ops-prod-cluster:

    • Removed all Traefik configuration files
    • Cleaned up traefik-gateway.yaml and traefik-middlewares.yaml

    ops-nonprod-cluster:

    • Removed Traefik configurations
    • Deleted the RKE2 ingress NGINX HelmChartConfig (rke2-ingress-nginx-config.yaml)

    The cluster-resources directories got significantly cleaner with this cleanup. Good riddance to configuration sprawl.

    Phase 4: Chart Maintenance and Server-Side Apply

    While I was in there making changes, I also:

    • Bumped several Helm charts to their latest versions:
      • ArgoCD: 9.1.5 → 9.1.7
      • External Secrets: 1.1.0 → 1.1.1
      • Linkerd components: 2025.11.3 → 2025.12.1
      • Grafana Alloy: 1.4.0 → 1.5.0
      • Common chart dependency: 4.4.0 → 4.5.0
      • Redis deployments updated across production and non-production
    • Migrated all clusters to use Server-Side Apply (ServerSideApply=true in the syncOptions):
      • All cluster tools in ops-argo
      • Production application sets (external-apps, production-apps, cluster-resources)
      • Non-production application sets (external-apps, cluster-resources)

    This is a better practice for ArgoCD as it allows Kubernetes to handle three-way merge patches instead of client-side strategic merge, reducing conflicts and improving sync reliability.

    Lessons Learned

    Gateway API is ready for production: The migration was surprisingly smooth. The Gateway API resources are well-documented and intuitive. With NGINX Ingress being retired, now’s the time to make the jump.

    HTTPRoute vs. Ingress: HTTPRoute is more expressive and allows for more sophisticated routing rules. The explicit parentRefs concept makes it clear which gateway handles which routes.

    Server-Side Apply everywhere: Should have done this sooner. The improved conflict handling makes ArgoCD much more reliable, especially when multiple controllers touch the same resources.

    Envoy’s configurability: The EnvoyProxy custom resource gives incredible control over the proxy configuration without needing to edit ConfigMaps or deal with annotations.

    Multi-cluster consistency: Making these changes across production and non-production environments simultaneously kept everything aligned and reduced cognitive overhead when switching between environments.

    Current Status

    All applications across all clusters are now running through Envoy Gateway with the Gateway API. Traffic is flowing correctly, TLS is terminating properly, and I’ve removed all the old ingress-related configuration from both production and non-production environments.

    The clusters are more standardized, the configuration is cleaner, and I’m positioned to take advantage of future Gateway API features like traffic splitting and more advanced routing capabilities. More importantly, I’m ahead of the March 2026 retirement deadline with plenty of time to spare.

    Now, the real question: what am I going to tinker with next?

  • ArgoCD panicked a little…

    I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.

    Whatever you do, don’t panic!

    Well, unless of course you are ArgoCD.

    I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.

    Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.

    Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:

    Recovered from panic: runtime error: invalid memory address or nil pointer

    A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.

    Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.

    Versioning….

    When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.

    My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.

    This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.

  • Moving to Ubuntu 24.04

    I have a small home lab running a few Kubernetes clusters, and a good bit of automation to deal with provisioning servers for the K8 clusters. All of my Linux VMs are based on Ubuntu 22.04. I prefer to stick with LTS for stability and compatibility.

    As April turns into July (missed some time there), I figured Ubuntu’s latest LTS (24.04) has matured to the point that I could start the process of updating my VMs to the new version.

    Easier than Expected

    In my previous move from 20.04 to 22.04, there were some changes to the automated installers for 22.04 that forced me down the path of testing my packer provisioning with the 22.04 ISOs. I expected similar changes with 24.04. I was pleasantly surprised when I realized that my existing scripts should work well with the 24.04 ISOs.

    I did spend a little time updating the Azure DevOps pipeline that builds a base image so that it supports building both a 22.04 and 24.04 image. I want to make sure I have the option to use the 22.04 images, should I find a problem with 24.04

    Migrating Cluster Nodes

    With a base image provisioned, I followed my normal process for upgrading cluster nodes on my non-production cluster. There were a few hiccups, mostly around some of my automated scripts that needed to have the appropriate settings to set hostnames correctly.

    Again, other than some script debugging, the process worked with minimal changes to my automation scripts and my provisioning projects.

    Azure DevOps Build Agent?

    Perhaps in a few months. I use the GitHub runner images as a base for my self-hosted agents, but there are some changes that need manual review. I destroy my Azure DevOps build agent weekly and generate a new one, and that’s a process that I need to make sure continues to work through any changes.

    The issue is typically time: the build agents take a few hours to provision because of all the tools that are installed. Testing that takes time, so I have to plan ahead. Plus, well, it is summertime, and I’d much rather be in the pool than behind the desk.

  • Automating Grafana Backups

    After a few data loss events, I took the time to automate my Grafana backups.

    A bit of instability

    It has been almost a year since I moved to a MySQL backend for Grafana. In that year, I’ve gotten a corrupted MySQL database twice now, forcing me to restore from a backup. I’m not sure if it is due to my setup or bad luck, but twice in less than a year is too much.

    In my previous post, I mentioned the Grafana backup utility as a way to preserve this data. My short-sightedness prevented me from automating those backups, however, so I suffered some data loss. After the most recent event, I revisited the backup tool.

    Keep your friends close…

    My first thought was to simply write a quick Azure DevOps pipeline to pull the tool down, run a backup, and copy it to my SAN. I would have also had to have included some scripting to clean up old backups.

    As I read through the grafana-backup-tool documents, though, I came across examples of running the tool as a Job in Kubernetes via a CronJob. This presented a very unique opportunity: configure the backup job as part of the Helm chart.

    What would that look like? Well, I do not install any external charts directly. They are configured as dependencies for charts of my own. Now, usually, that just means a simple values file that sets the properties on the dependency. In the case of Grafana, though, I’ve already used this functionality to add two dependent charts (Grafana and MySQL) to create one larger application.

    This setup also allows me to add additional templates to the Helm chart to create my own resources. I added two new resources to this chart:

    1. grafana-backup-cron – A definition for the cronjob, using the ysde/grafana-backup-tool image.
    2. grafana-backup-secret-es – An ExternalSecret definition to pull secrets from Hashicorp Vault and create a Secret for the job.

    Since this is all built as part of the Grafana application, the secrets for Grafana were already available. I went so far as to add a section in the values file for the backup. This allowed me to enable/disable the backup and update the image tag easily.

    Where to store it?

    The other enhancement I noticed in the backup tool was the ability to store files in S3 compatible storage. In fact, their example showed how to connect to a MinIO instance. As fate would have it, I have a MinIO instance running on my SAN already.

    So I configured a new bucket in my MinIO instance, added a new access key, and configured those secrets in Vault. After committing those changes and synchronizing in ArgoCD, the new resources were there and ready.

    Can I test it?

    Yes I can. Google, once again, pointed me to a way to create a Job from a CronJob:

    kubectl create job --from=cronjob/<cronjob-name> <job-name> -n <namespace-name>

    I ran the above command to create a test job. And, viola, I have backup files in MinIO!

    Cleaning up

    Unfortunately, there doesn’t seem to be a retention setting in the backup tool. It looks like I’m going to have to write some code to clean up my Grafana backups bucket, especially since I have daily backups scheduled. Either that, or look at this issue and see if I can add it to the tool. Maybe I’ll brush off my Python skills…

  • My Introduction to Kubernetes NetworkPolicy

    The Bitnami Redis Helm chart has thrown me a curve ball over the last week or so, and made me look at Kubernetes NetworkPolicy resources.

    Redis Chart Woes

    Bitnami seems to be updating their charts to include default NetworkPolicy resources. While I don’t mind this, a jaunt through their open issues suggests that it has not been a smooth transition.

    The redis chart’s initial release of NetworkPolicy objects broke the metrics container, since the default NetworkPolicy didn’t add the metrics port to allowed ingress ports.

    So I sat on the old chart until the new Redis chart was available.

    And now, Connection Timeouts

    Once the update was released, I rolled out the new version of Redis. The containers came up, and I didn’t really think twice about it. Until, that is, I decided to do some updates to both my applications and my Kubernetes nodes.

    I upgraded some of my internal applications to .Net 8. This caused all of them to restart, and, in the process, get their linkerd-proxy sidecars running. I also started cycling the nodes on my internal cluster. When it came time to call my Unifi IP Manager API to delete an old assigned IP, I got an internal server error.

    A quick check of the logs showed that the pod’s Redis connection was failing. Odd, I thought, since most other connections have been working fine, at least through last week.

    After a few different Google searches, I came across this section in the Linkerd.io documentation. As it turns out, when you use NetworkPolicy resources and opaque ports (like Redis), you have to make sure that Linkerd’s inbound port (which defaults to 4143) is also setup in the NetworkPolicy.

    Adding the Linkerd port to the extraIngress section in the Redis Helm chart worked wonders. With that section in place, connectivity was restored and I could go about my maintenance tasks.

    NetworkPolicy for all?

    Maybe. This is my first exposure to them, so I would like to understand how they operate and what best practices are for such things. In the meantime, I’ll be a little more wary when I see NetworkPolicy resources pop up in external charts.

  • A Tale of Two Proxies

    I am working on building a set of small reference applications to demonstrate some of the patterns and practices to help modernize cloud applications. In configuring all of this in my home lab, I spent at least 3 hours fighting a problem that turned out to be a configuration issue.

    Backend-for-Frontend Pattern

    I will get into more details when I post the full application, but I am trying to build out a SPA with a dedicated backend API that would host the SPA and take care of authentication. As is typically the case, I was able to get all of this working on my local machine, including the necessary proxying of calls via the SPA’s development server (again, more on this later).

    At some point, I had two containers ready to go: a BFF container hosting the SPA and the dedicated backend, and an API container hosting a data service. I felt ready to deploy to the Kubernetes cluster in my lab.

    Let the pain begin!

    I have enough samples within Helm/Helmfile that getting the items deployed was fairly simple. After fiddling with the settings of the containers, things were running well in the non-authenticated mode.

    However, when I clicked login, the following happened:

    1. I was redirected to my oAuth 2.0/OIDC provider.
    2. I entered my username/password
    3. I was redirected back to my application
    4. I got a 502 Bad Gateway screen

    502! But, why? I consulted Google and found any number of articles indicating that, in the authentication flow, Nginx’s default header size limit is too small to limit what might be coming back from the redirect. So, consulting the Nginx configuration documents, I changed the Nginx configuration in my reverse proxy to allow for larger headers.

    No luck. Weird. In the spirit of true experimentation (change one thing at a time), I backed those changes out and tried changing the configuration of my Nginx Ingress controller. No luck. So what’s going on?

    Too Many Cooks

    My current implementation looks like this:

    flowchart TB
        A[UI] --UI Request--> B(Nginx Reverse Proxy)
        B --> C("Kubernetes Ingress (Nginx)")
        C --> D[UI Pod]
    

    There are two Nginx instances between all of my traffic: an instance outside of the cluster that serves as my reverse proxy, and an Nginx ingress controller that serves as the reverse proxy within the cluster.

    I tried changing both separately. Then I tried changing both at the same time. And I was still seeing this error. As it turns out, well, I was being passed some bad data as well.

    Be careful what you read on the Internet

    As it turns out, the issue was the difference in configuration between the two Nginx instances and some bad configuration values that I got from old internet articles.

    Reverse Proxy Configuration

    For the Nginx instance running on Ubuntu, I added the following to my nginx.conf file under the http section:

            proxy_buffers 4 512k;
            proxy_buffer_size 256k;
            proxy_busy_buffers_size 512k;
            client_header_buffer_size 32k;
            large_client_header_buffers 4 32k;

    Nginx Ingress Configuration

    I am running RKE2 clusters, so configuring Nginx involves a HelmChartConfig resource being created in the kube-system namespace. My cluster configuration looks like this:

    apiVersion: helm.cattle.io/v1
    kind: HelmChartConfig
    metadata:
      name: rke2-ingress-nginx
      namespace: kube-system
    spec:
      valuesContent: |-
        controller:
          kind: DaemonSet
          daemonset:
            useHostPort: true
          config:
            use-forwarded-headers: "true"
            proxy-buffer-size: "256k"
            proxy-buffers-number: "4"
            client-header-buffer-size: "256k"
            large-client-header-buffers: "4 16k"
            proxy-body-size: "10m"

    The combination of both of these settings got my redirects to work without the 502 errors.

    Better living through logging

    One of the things I fought with on this was finding the appropriate logs to see where the errors were occurring. I’m exporting my reverse proxy logs into Loki using a Promtail instance that listens on a syslog port. So I am “getting” the logs into Loki, but I couldn’t FIND them.

    I forgot about the facility in syslog: I have the access logs sending as local5, but did configured the error logs without pointing them to local5. I learned that, by default, they go to local7.

    Once I found the logs I was able to diagnose the issue, but I spent a lot of time browsing in Loki looking for those logs.

  • Re-configuring Grafana Secrets

    I recently fixed some synchronization issues that had been silently plaguing some of the monitoring applications I had installed, including my Loki/Grafana/Tempo/Mimir stack. Now that the applications are being updated, I ran into an issue with the latest Helm chart’s handling of secrets.

    Sync Error?

    After I made the change to fix synchronization of the Helm charts, I went to sync my Grafana chart, but received a sync error:

    Error: execution error at (grafana/charts/grafana/templates/deployment.yaml:36:28): Sensitive key 'database.password' should not be defined explicitly in values. Use variable expansion instead.

    I certainly didn’t change anything in those files, and I am already using variable expansion in the values.yaml file anyway. What does that mean? Basically, in the values.yaml file, I used ${ENV_NAME} in areas where I had a secret value, and told Grafana to expand environment variables into the configuration.

    The latest version of Helm doesn’t seem to like this. It views ANY value in secret fields to be bad. A search of the Grafana Helm Chart repo’s issues list yielded someone with a similar issue and a comment with a link to another comment that is the recommended solution.

    Same Secret, New Name

    After reading through the comment’s suggestion and Grafana’s documentation on overriding configuration with environment variables, I realized the fix was pretty easy.

    I already had a Kubernetes secret being populated from Hashicorp Vault with my secret values. I also already had envFromSecret set in the values.yaml to instruct the chart to use my secret. And, through some dumb luck, two of the three values were already named using the standards in Grafana’s documentation.

    So the “fix” was to simply remove the secret expansions from the values.yaml file, and rename one of the secretKey values so that it matched Grafana’s environment variable template. You can see the diff of the change in my Github repository.

    With that change, the Helm chart generated correctly, and once Argo had the changes in place, everything was up and running.

  • Synced, But Not: ArgoCD Differencing Configuration

    Some of the charts in my Loki/Grafana/Tempo/Mimir stack have an odd habit of not updating correctly in ArgoCD. I finally got tired of it and fixed it… I’m just not 100% sure how.

    Ignoring Differences

    At some point in the past, I had customized a few of my Application objects with ignoreDifferences settings. It was meant to tell ArgoCD to ignore fields that are managed by other things, and could change from the chart definition.

    Like what, you might ask? Well, the external-secrets chart generates it’s own caBundle and sets properties on a ValidatingWebhookConfiguration object. Obviously, that’s managed by the controller, and I don’t want to mess with it. However, I also don’t want ArgoCD to report that chart as Out of Sync all the time.

    So, as an example, my external-secrets application looks like this:

    project: cluster-tools
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-argo'
      path: cluster-tools/tools/external-secrets
      targetRevision: main
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: external-secrets
    syncPolicy:
      syncOptions:
        - CreateNamespace=true
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: '*'
        kind: '*'
        managedFieldsManagers:
          - external-secrets

    And that worked just fine. But, with my monitor stack, well, I think I made a boo-boo.

    Ignoring too much

    When I looked at the application differences for some of my Grafana resources, I noticed that the live vs desired image was wrong. My live image was older than the desired one, and yet, the application wasn’t showing as out of sync.

    At this point, I suspected ignoreDifferences was the issue, so I looked at the Application manifest. For some reason, my monitoring applications had an Application manifest that looked like this:

    project: external-apps
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-internal-cluster'
      path: external/monitor/grafana
      targetRevision: main
      helm:
        valueFiles:
          - values.yaml
        version: v3
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: monitoring
    syncPolicy:
      syncOptions:
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: "*"
        kind: "*"
        managedFieldsManagers:
        - argocd-controller
      - group: '*'
        kind: StatefulSet
        jsonPointers:
          - /spec/persistentVolumeClaimRetentionPolicy
          - /spec/template/metadata/annotations/'kubectl.kubernetes.io/restartedAt'

    Notices the part where I am ignoring managed fields from argocd-controller. I have no idea why I added that, but, it looks a little “all inclusive” for my tastes, and it was ONLY present in the ApplicationSet for my LGTM stack. So I commented it out.

    Now We’re Cooking!

    Lo and behold, ArgoCD looked at my monitoring stack and said “well, you have some updates, don’t you!” I spent the next few minutes syncing those applications individually. Why? There are a lot of hard working pods in those applications, I don’t like to cycle them all at once.

    I searched through my posts and some of my notes, and I honestly have no idea why I decided I should ignore all fields managed by argocd-controller. Needless to say, I will not be doing that again.

  • Do the Right Thing

    My home lab clusters have been running fairly stable, but there are still some hiccups every now and again. As usual, a little investigation lead to a pretty substantial change.

    Cluster on Fire

    My production and non-production clusters, which mostly host my own projects, have always been pretty stable. Both clusters are set up with 3 nodes as the control plane, since I wanted more than 1 and you need an odd number for quorum. And since I didn’t want to run MORE machines as agents, I just let those nodes host user workloads in addition to the control plane. With 4 vCPUs and 8 GB of RAM per node, well, those clusters had no issues.

    My “internal tools” cluster is another matter. Between Mimir, Loki, and Tempo running ingestion, there is a lot going on in that cluster. I added a 4th node that serves as just an agent for that cluster, but I still had some pod stability issues.

    I started digging into the node-exporter metrics for the three “control plane + worker” nodes in the internal cluster, and they were, well, on fire. The system load was consistently over 100% (the 15 minute average was something like 4.05 out of 4 on all three). I was clearly crushing those nodes. And, since those nodes hosted the control plane as well as the workloads, instability in the control plane caused instability in the cluster.

    Isolating the Control Plane

    At that point, I decided that I could not wait any longer. I had to isolate the control plane and etcd from the rest of my workloads. While I know that it is, in fact, best practice, I was hoping to avoid it in the lab, as it causes a slight proliferation in VMs. How so? Let’s do the math:

    Right now, all of my clusters have at least 3 nodes, and internal has 4. So that’s 10 VMs with 4 vCPU and 8 GB of RAM assigned, or 40 vCPUs and 80 GB of RAM. If I want all of my clusters to have isolated control planes, that means more VMs. But…

    Control plane nodes don’t need nearly the size if I’m not opening them up to other workloads. And for my non-production cluster, I don’t need the redundancy of multiple control plane nodes. So 4 vCPUs/8GB RAM becomes 2 vCPU/4GB RAM for control plane node, and I can use 1 node for the non-production control plane. But what about the work? To start, I’ll use 2 4 vCPUs/8GB RAM nodes for production and non-production, and 3 of those same node sizes for the internal cluster.

    In case you aren’t keeping a running total, the new plan is as follows:

    • 7 small nodes (2 vCPU/4GB RAM) for control plane nodes across the three clusters (3 for internal and production, 1 for non-production)
    • 7 medium nodes (4 vCPU/8GB RAM) for worker nodes across the three clusters (2 for non-production and production, 3 for internal).

    So, it’s 14 VMs, up from 10, but it is only an extra 2 vCPUs and 2 GB of RAM. I suppose I can live with that.

    Make it so!

    With the scripting of most of my server creation, I made a few changes to support this updated structure. I added a taint to the RKE2 configuration for the server so that only critical items are scheduled.

    node-taint:
    - CriticalAddonsOnly=true:NoExecute

    I also removed any server nodes from the tfx-<cluster name> DNS record, since the Nginx pods will only run on agent nodes now.

    Once that was done, I just had to provision new agent nodes for each of the clusters, and then replace the current server nodes with newly provisioned nodes that have a smaller footprint and the appropriate taints.

    It’s worth noting, in order to prevent too much churn, I manually added the above taint to each existing server node AFTER I had all the agents provisioned but before I started replacing server nodes. That way, Kubernetes would not attempt to schedule a user workload coming off the old server onto another server node, but instead force it on to an agent. For your reference and mine, that command looks like this:

    kubectl taint nodes <node name> CriticalAddonsOnly=true:NoSchedule

    Results

    I would classify this as a success with an asterisk next to it. I need more time to determine if the cluster stability, particularly for the internal cluster, improves with these changes, so I am not willing to declare outright victory.

    It has, however, given me a much better view into how much processing I actually need in a cluster. For my non-production cluster, the two agents are averaging under 10% load, which means I could probably lose one agent and still be well under 50% load on that node. The production agents are averaging about 15% load. Sure, I could consolidate, but part of the desire is to have some redundancy, so I’ll stick with two agents in production.

    The internal cluster, however, is running pretty hot. I’m running a number of pods for Grafana Mimir/Loki/Tempo ingestion, as well as Prometheus on that cluster itself. So those three nodes are running at about 50-55% average load, with spikes above 100% on the one agent that is running both the Prometheus collector and a copy of the Mimir ingester. I’m going to keep an eye on that and see if the load creeps up. In the meantime, I’ll also be looking to see what, if anything, can be optimized or offloaded. If I find something to fix, you can be sure it’ll make a post.