In December, I migrated all of my home lab clusters from NGINX Ingress to Envoy Gateway. I wrote about the process in Modernizing the Gateway, covering the phased rollout, the HTTPRoute conversions, and the cleanup of years of Traefik and NGINX configuration.
Three months later, the most honest thing I can say about the migration is this: I forgot it happened.
No incidents. No late-night debugging sessions. No "why is this service unreachable" Slack messages to myself. The traffic flows, TLS terminates, and routes resolve. Envoy Gateway has been completely invisible — and if you've ever operated a reverse proxy, you know that invisible is the highest compliment you can pay one.
The Clock Ran Out
This month — March 2026 — NGINX Ingress Controller is officially retired. No more releases, no security patches, no bug fixes. The Kubernetes SIG Network announcement from November 2024 made the recommendation clear: migrate to the Gateway API.
If you're still running NGINX Ingress, you're now running unsupported software in your ingress path. That's the component handling every request into your cluster. It's not the place to carry technical debt.
I got ahead of the deadline by about three months. Not because I'm unusually disciplined — I just happened to be in the middle of a network rebuild and the timing lined up. But even if it hadn't, the migration turned out to be straightforward enough that I'd have been comfortable doing it under pressure.
Why This Migration Was Boring (In the Best Way)
Not every migration I've written about on this blog has gone smoothly. I've had my share of "what do you mean the cluster won't come back up" moments. So what made this one different?
The ecosystem was ready. Most of the Helm charts I use already supported Gateway API resources natively. I wasn't hand-rolling HTTPRoute manifests from scratch or writing custom templates — the charts had gateway configuration sections waiting to be enabled. When the tooling meets you where you are, migrations shrink from projects to tasks.
The Gateway API is genuinely better. HTTPRoute is more expressive than Ingress. The explicit parentRefs model makes it immediately clear which gateway handles which route. No more guessing which ingress class annotation you need, or whether the controller will actually pick up your resource. The separation between infrastructure operators (who manage Gateways) and application developers (who manage Routes) maps cleanly to how I think about my own deployments, even as a team of one.
Envoy is battle-tested. Envoy Gateway is the new part, but Envoy proxy has been handling production traffic at massive scale for years. I wasn't betting on an unproven proxy — I was adopting a new management layer for a proven one. The EnvoyProxy custom resource gives real control over proxy behavior without the annotation soup that NGINX Ingress required.
What I'm Not Using
Honesty requires mentioning the features I haven't touched. Envoy Gateway supports traffic splitting, request mirroring, and multi-cluster routing. I'm not using any of them.
Multi-cluster traffic management is handled by Linkerd's multicluster extension in my setup, and it works well. I didn't migrate to Envoy Gateway to replace that — I migrated because my ingress controller was being retired and the Gateway API is the clear path forward.
The advanced routing features are there if I need them. I don't yet. And I'd rather have capabilities on the shelf than be forced into a migration when I actually need them.
The Honest Recommendation
If you're still on NGINX Ingress: migrate now. Not next quarter, not when you "have time." The retirement is here, and the migration is genuinely not that bad.
Here's what worked for me:
Deploy Envoy Gateway alongside your existing ingress. Don't rip and replace. Run both, migrate routes one at a time.
Start with your least critical service. Get comfortable with HTTPRoute syntax on something that won't page you.
Check your Helm charts. You might be surprised how many already support Gateway API resources. The migration might be a values file change, not a manifest rewrite.
Clean up after yourself. Once everything is migrated, remove the old ingress controller entirely. Don't leave it running "just in case." Dead configuration is how you end up with 124 devices on your network and no idea what half of them do.
The whole process took me about a week across three clusters, and most of that was methodical testing rather than actual problem-solving.
The Takeaway
I've written nearly 200 posts on this blog, and a surprising number of them are migration stories. Some were painful. Some were educational. Some were both. This one was neither — and that's the best possible outcome.
The goal of infrastructure isn't to be interesting. It's to be invisible. Envoy Gateway got out of the way and let me focus on the things that actually matter. Three months in, I have nothing to report. And sometimes, that's the best review I can give.
Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.
This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?
Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.
The Problem: Telemetry Taking the Scenic Route
My setup looked like this:
– Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway
– Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.
– Nonproduction cluster: Same deal, different tenant ID
Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.
This meant:
– An unnecessary hop through the Nginx proxy layer
– Extra TLS handshakes that didn’t add security value between internal services
– DNS resolution for external service names when direct cluster DNS would suffice
– One more component in the path that could cause issues
The Solution: Hub-and-Spoke with Linkerd Multi-Cluster
Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.
For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.
The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.
Gateway Mode vs. Flat Network
(Spoiler: Gateway Mode Won)
Linkerd offers two approaches for multi-cluster communication:
Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.
Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.
The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.
The Configuration: Getting the Helm Values Right
Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:
linkerd-multicluster:gateway:enabled: trueport: 4143serviceType: NodePortnodePort: 30143probe:port: 4191nodePort: 30191# Grant access to service accounts from other clustersremoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction
And for the production/nonproduction clusters:
linkerd-multicluster:gateway:enabled: false# No gateway needed hereremoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local
The Link: Connecting Clusters Without Auto-Discovery
Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:
Separating --gateway-addresses and --gateway-port made all the difference.
I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.
Service Export: Making Services Visible Across Clusters
Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.
For the Loki gateway (and similarly for Mimir and Tempo):
The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:
No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.
But wait—there’s one more step.
The Linkerd Injection Gotcha
Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.
This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:
alloy:controller: # Not alloy.alloy!podAnnotations:linkerd.io/inject: enabledalloy:# ... other config
Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:
– `linkerd-proxy` (the magic sauce)
– `alloy` (the telemetry collector)
– `config-reloader` (for hot config reloads)
Checking the gateway logs confirmed traffic was flowing:
There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.
From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.
What I Learned
1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.
2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.
3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.
4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.
5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.
The Result
Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.
I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.
The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.
As with any good engineer, I cannot leave well enough alone. Over the past week, I’ve been working through a significant infrastructure modernization across my home lab clusters – migrating from NGINX Ingress to Envoy Gateway and implementing the Kubernetes Gateway API. This also involved some necessary housekeeping with chart updates and a shift to Server-Side Apply for all ArgoCD-managed resources.
Why Change?
The timing couldn’t have been better. In November 2024, the Kubernetes SIG Network and Security Response Committee announced that Ingress NGINX will be retired in March 2026. The project has struggled with insufficient maintainer support, security concerns around configuration snippets, and accumulated technical debt. After March 2026, there will be no further releases, security patches, or bug fixes.
The announcement strongly recommends migrating to the Gateway API, described as “the modern replacement for Ingress.” This validated what I’d already been considering – the Gateway API provides a more standardized, vendor-neutral approach with better separation of concerns between infrastructure operators and application developers.
Envoy Gateway, being a CNCF project built on the battle-tested Envoy proxy, seemed like a natural choice for this migration. Plus, it gave me an excuse to finally move off Traefik, which was… well, let’s just say it was time for a change.
The Migration Journey
The migration happened in phases across my ops-argo, ops-prod-cluster, and ops-nonprod-cluster repositories. Here’s what changed:
Phase 1: Adding Envoy Gateway
I started by adding Envoy Gateway as a cluster tool, complete with its own ApplicationSet that deploys to clusters labeled with spydersoft.io/envoy-gateway: "true". The deployment includes:
GatewayClass and Gateway resources: Defined a main gateway that handles traffic routing
EnvoyProxy configuration: Set up with a static NodePort service for consistent external access
ClientTrafficPolicy: Configured to properly handle forwarded headers – crucial for preserving client IP information through the proxy chain
The Envoy Gateway deployment lives in the envoy-gateway-system namespace and exposes services via NodePort 30080 and 30443, making it easy to integrate with my existing network setup.
Phase 2: Migrating Applications to HTTPRoute
This was the bulk of the work. Each application needed its Ingress resource replaced with an HTTPRoute. The new Gateway API resources are much cleaner. For example, my blog (www.mattgerega.com) went from an Ingress definition to this:
This is a better practice for ArgoCD as it allows Kubernetes to handle three-way merge patches instead of client-side strategic merge, reducing conflicts and improving sync reliability.
Lessons Learned
Gateway API is ready for production: The migration was surprisingly smooth. The Gateway API resources are well-documented and intuitive. With NGINX Ingress being retired, now’s the time to make the jump.
HTTPRoute vs. Ingress: HTTPRoute is more expressive and allows for more sophisticated routing rules. The explicit parentRefs concept makes it clear which gateway handles which routes.
Server-Side Apply everywhere: Should have done this sooner. The improved conflict handling makes ArgoCD much more reliable, especially when multiple controllers touch the same resources.
Envoy’s configurability: The EnvoyProxy custom resource gives incredible control over the proxy configuration without needing to edit ConfigMaps or deal with annotations.
Multi-cluster consistency: Making these changes across production and non-production environments simultaneously kept everything aligned and reduced cognitive overhead when switching between environments.
Current Status
All applications across all clusters are now running through Envoy Gateway with the Gateway API. Traffic is flowing correctly, TLS is terminating properly, and I’ve removed all the old ingress-related configuration from both production and non-production environments.
The clusters are more standardized, the configuration is cleaner, and I’m positioned to take advantage of future Gateway API features like traffic splitting and more advanced routing capabilities. More importantly, I’m ahead of the March 2026 retirement deadline with plenty of time to spare.
Now, the real question: what am I going to tinker with next?
I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.
Whatever you do, don’t panic!
Well, unless of course you are ArgoCD.
I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.
Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.
Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:
Recovered from panic: runtime error: invalid memory address or nil pointer
A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.
Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.
Versioning….
When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.
My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.
This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.
I have a small home lab running a few Kubernetes clusters, and a good bit of automation to deal with provisioning servers for the K8 clusters. All of my Linux VMs are based on Ubuntu 22.04. I prefer to stick with LTS for stability and compatibility.
As April turns into July (missed some time there), I figured Ubuntu’s latest LTS (24.04) has matured to the point that I could start the process of updating my VMs to the new version.
Easier than Expected
In my previous move from 20.04 to 22.04, there were some changes to the automated installers for 22.04 that forced me down the path of testing my packer provisioning with the 22.04 ISOs. I expected similar changes with 24.04. I was pleasantly surprised when I realized that my existing scripts should work well with the 24.04 ISOs.
I did spend a little time updating the Azure DevOps pipeline that builds a base image so that it supports building both a 22.04 and 24.04 image. I want to make sure I have the option to use the 22.04 images, should I find a problem with 24.04
Migrating Cluster Nodes
With a base image provisioned, I followed my normal process for upgrading cluster nodes on my non-production cluster. There were a few hiccups, mostly around some of my automated scripts that needed to have the appropriate settings to set hostnames correctly.
Perhaps in a few months. I use the GitHub runner images as a base for my self-hosted agents, but there are some changes that need manual review. I destroy my Azure DevOps build agent weekly and generate a new one, and that’s a process that I need to make sure continues to work through any changes.
The issue is typically time: the build agents take a few hours to provision because of all the tools that are installed. Testing that takes time, so I have to plan ahead. Plus, well, it is summertime, and I’d much rather be in the pool than behind the desk.
After a few data loss events, I took the time to automate my Grafana backups.
A bit of instability
It has been almost a year since I moved to a MySQL backend for Grafana. In that year, I’ve gotten a corrupted MySQL database twice now, forcing me to restore from a backup. I’m not sure if it is due to my setup or bad luck, but twice in less than a year is too much.
In my previous post, I mentioned the Grafana backup utility as a way to preserve this data. My short-sightedness prevented me from automating those backups, however, so I suffered some data loss. After the most recent event, I revisited the backup tool.
Keep your friends close…
My first thought was to simply write a quick Azure DevOps pipeline to pull the tool down, run a backup, and copy it to my SAN. I would have also had to have included some scripting to clean up old backups.
As I read through the grafana-backup-tool documents, though, I came across examples of running the tool as a Job in Kubernetes via a CronJob. This presented a very unique opportunity: configure the backup job as part of the Helm chart.
What would that look like? Well, I do not install any external charts directly. They are configured as dependencies for charts of my own. Now, usually, that just means a simple values file that sets the properties on the dependency. In the case of Grafana, though, I’ve already used this functionality to add two dependent charts (Grafana and MySQL) to create one larger application.
This setup also allows me to add additional templates to the Helm chart to create my own resources. I added two new resources to this chart:
grafana-backup-cron – A definition for the cronjob, using the ysde/grafana-backup-tool image.
grafana-backup-secret-es – An ExternalSecret definition to pull secrets from Hashicorp Vault and create a Secret for the job.
Since this is all built as part of the Grafana application, the secrets for Grafana were already available. I went so far as to add a section in the values file for the backup. This allowed me to enable/disable the backup and update the image tag easily.
Where to store it?
The other enhancement I noticed in the backup tool was the ability to store files in S3 compatible storage. In fact, their example showed how to connect to a MinIO instance. As fate would have it, I have a MinIO instance running on my SAN already.
So I configured a new bucket in my MinIO instance, added a new access key, and configured those secrets in Vault. After committing those changes and synchronizing in ArgoCD, the new resources were there and ready.
Can I test it?
Yes I can. Google, once again, pointed me to a way to create a Job from a CronJob:
I ran the above command to create a test job. And, viola, I have backup files in MinIO!
Cleaning up
Unfortunately, there doesn’t seem to be a retention setting in the backup tool. It looks like I’m going to have to write some code to clean up my Grafana backups bucket, especially since I have daily backups scheduled. Either that, or look at this issue and see if I can add it to the tool. Maybe I’ll brush off my Python skills…
The Bitnami Redis Helm chart has thrown me a curve ball over the last week or so, and made me look at Kubernetes NetworkPolicy resources.
Redis Chart Woes
Bitnami seems to be updating their charts to include default NetworkPolicy resources. While I don’t mind this, a jaunt through their open issues suggests that it has not been a smooth transition.
The redis chart’s initial release of NetworkPolicy objects broke the metrics container, since the default NetworkPolicy didn’t add the metrics port to allowed ingress ports.
So I sat on the old chart until the new Redis chart was available.
And now, Connection Timeouts
Once the update was released, I rolled out the new version of Redis. The containers came up, and I didn’t really think twice about it. Until, that is, I decided to do some updates to both my applications and my Kubernetes nodes.
I upgraded some of my internal applications to .Net 8. This caused all of them to restart, and, in the process, get their linkerd-proxy sidecars running. I also started cycling the nodes on my internal cluster. When it came time to call my Unifi IP Manager API to delete an old assigned IP, I got an internal server error.
A quick check of the logs showed that the pod’s Redis connection was failing. Odd, I thought, since most other connections have been working fine, at least through last week.
After a few different Google searches, I came across this section in the Linkerd.io documentation. As it turns out, when you use NetworkPolicy resources and opaque ports (like Redis), you have to make sure that Linkerd’s inbound port (which defaults to 4143) is also setup in the NetworkPolicy.
Maybe. This is my first exposure to them, so I would like to understand how they operate and what best practices are for such things. In the meantime, I’ll be a little more wary when I see NetworkPolicy resources pop up in external charts.
I am working on building a set of small reference applications to demonstrate some of the patterns and practices to help modernize cloud applications. In configuring all of this in my home lab, I spent at least 3 hours fighting a problem that turned out to be a configuration issue.
Backend-for-Frontend Pattern
I will get into more details when I post the full application, but I am trying to build out a SPA with a dedicated backend API that would host the SPA and take care of authentication. As is typically the case, I was able to get all of this working on my local machine, including the necessary proxying of calls via the SPA’s development server (again, more on this later).
At some point, I had two containers ready to go: a BFF container hosting the SPA and the dedicated backend, and an API container hosting a data service. I felt ready to deploy to the Kubernetes cluster in my lab.
Let the pain begin!
I have enough samples within Helm/Helmfile that getting the items deployed was fairly simple. After fiddling with the settings of the containers, things were running well in the non-authenticated mode.
However, when I clicked login, the following happened:
I was redirected to my oAuth 2.0/OIDC provider.
I entered my username/password
I was redirected back to my application
I got a 502 Bad Gateway screen
502! But, why? I consulted Google and found any number of articles indicating that, in the authentication flow, Nginx’s default header size limit is too small to limit what might be coming back from the redirect. So, consulting the Nginx configuration documents, I changed the Nginx configuration in my reverse proxy to allow for larger headers.
No luck. Weird. In the spirit of true experimentation (change one thing at a time), I backed those changes out and tried changing the configuration of my Nginx Ingress controller. No luck. So what’s going on?
Too Many Cooks
My current implementation looks like this:
flowchart TB
A[UI] --UI Request--> B(Nginx Reverse Proxy)
B --> C("Kubernetes Ingress (Nginx)")
C --> D[UI Pod]
There are two Nginx instances between all of my traffic: an instance outside of the cluster that serves as my reverse proxy, and an Nginx ingress controller that serves as the reverse proxy within the cluster.
I tried changing both separately. Then I tried changing both at the same time. And I was still seeing this error. As it turns out, well, I was being passed some bad data as well.
Be careful what you read on the Internet
As it turns out, the issue was the difference in configuration between the two Nginx instances and some bad configuration values that I got from old internet articles.
Reverse Proxy Configuration
For the Nginx instance running on Ubuntu, I added the following to my nginx.conf file under the http section:
I am running RKE2 clusters, so configuring Nginx involves a HelmChartConfig resource being created in the kube-system namespace. My cluster configuration looks like this:
The combination of both of these settings got my redirects to work without the 502 errors.
Better living through logging
One of the things I fought with on this was finding the appropriate logs to see where the errors were occurring. I’m exporting my reverse proxy logs into Loki using a Promtail instance that listens on a syslog port. So I am “getting” the logs into Loki, but I couldn’t FIND them.
I forgot about the facility in syslog: I have the access logs sending as local5, but did configured the error logs without pointing them to local5. I learned that, by default, they go to local7.
Once I found the logs I was able to diagnose the issue, but I spent a lot of time browsing in Loki looking for those logs.
I recently fixed some synchronization issues that had been silently plaguing some of the monitoring applications I had installed, including my Loki/Grafana/Tempo/Mimir stack. Now that the applications are being updated, I ran into an issue with the latest Helm chart’s handling of secrets.
Sync Error?
After I made the change to fix synchronization of the Helm charts, I went to sync my Grafana chart, but received a sync error:
Error: execution error at (grafana/charts/grafana/templates/deployment.yaml:36:28): Sensitive key 'database.password' should not be defined explicitly in values. Use variable expansion instead.
I certainly didn’t change anything in those files, and I am already using variable expansion in the values.yaml file anyway. What does that mean? Basically, in the values.yaml file, I used ${ENV_NAME} in areas where I had a secret value, and told Grafana to expand environment variables into the configuration.
I already had a Kubernetes secret being populated from Hashicorp Vault with my secret values. I also already had envFromSecret set in the values.yaml to instruct the chart to use my secret. And, through some dumb luck, two of the three values were already named using the standards in Grafana’s documentation.
So the “fix” was to simply remove the secret expansions from the values.yaml file, and rename one of the secretKey values so that it matched Grafana’s environment variable template. You can see the diff of the change in my Github repository.
With that change, the Helm chart generated correctly, and once Argo had the changes in place, everything was up and running.
Some of the charts in my Loki/Grafana/Tempo/Mimir stack have an odd habit of not updating correctly in ArgoCD. I finally got tired of it and fixed it… I’m just not 100% sure how.
Ignoring Differences
At some point in the past, I had customized a few of my Application objects with ignoreDifferences settings. It was meant to tell ArgoCD to ignore fields that are managed by other things, and could change from the chart definition.
Like what, you might ask? Well, the external-secrets chart generates it’s own caBundle and sets properties on a ValidatingWebhookConfiguration object. Obviously, that’s managed by the controller, and I don’t want to mess with it. However, I also don’t want ArgoCD to report that chart as Out of Sync all the time.
So, as an example, my external-secrets application looks like this:
And that worked just fine. But, with my monitor stack, well, I think I made a boo-boo.
Ignoring too much
When I looked at the application differences for some of my Grafana resources, I noticed that the live vs desired image was wrong. My live image was older than the desired one, and yet, the application wasn’t showing as out of sync.
At this point, I suspected ignoreDifferences was the issue, so I looked at the Application manifest. For some reason, my monitoring applications had an Application manifest that looked like this:
Notices the part where I am ignoring managed fields from argocd-controller. I have no idea why I added that, but, it looks a little “all inclusive” for my tastes, and it was ONLY present in the ApplicationSet for my LGTM stack. So I commented it out.
Now We’re Cooking!
Lo and behold, ArgoCD looked at my monitoring stack and said “well, you have some updates, don’t you!” I spent the next few minutes syncing those applications individually. Why? There are a lot of hard working pods in those applications, I don’t like to cycle them all at once.
I searched through my posts and some of my notes, and I honestly have no idea why I decided I should ignore all fields managed by argocd-controller. Needless to say, I will not be doing that again.