Category: Software

  • Simplifying Internal Routing

    Centralizing Telemetry with Linkerd Multi-Cluster

    Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.

    This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?

    Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.

    The Problem: Telemetry Taking the Scenic Route

    My setup looked like this:

    Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway

    Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.

    Nonproduction cluster: Same deal, different tenant ID

    Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.

    This meant:

    – An unnecessary hop through the Nginx proxy layer

    – Extra TLS handshakes that didn’t add security value between internal services

    – DNS resolution for external service names when direct cluster DNS would suffice

    – One more component in the path that could cause issues

    The Solution: Hub-and-Spoke with Linkerd Multi-Cluster

    Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.

    For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.

    The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.

    Gateway Mode vs. Flat Network

    (Spoiler: Gateway Mode Won)

    Linkerd offers two approaches for multi-cluster communication:

    Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.

    Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.

    The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.

    The Configuration: Getting the Helm Values Right

    Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:

    linkerd-multicluster:
      gateway:
        enabled: true
        port: 4143
        serviceType: NodePort
        nodePort: 30143
        probe:
          port: 4191
          nodePort: 30191
    
      # Grant access to service accounts from other clusters
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction

    And for the production/nonproduction clusters:

    linkerd-multicluster:
      gateway:
        enabled: false  # No gateway needed here
    
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local

    The Link: Connecting Clusters Without Auto-Discovery

    Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:

    linkerd multicluster link --cluster-name internal --gateway-addresses internal.example.com:30143

    But that command tries to do DNS lookups on the combined hostname+port string, which fails spectacularly. The fix was simple once I found it:

    linkerd multicluster link \
      --cluster-name internal \
      --gateway-addresses tfx-internal.gerega.net \
      --gateway-port 30143 \
      --gateway-probe-port 30191 \
      --api-server-address https://cp-internal.gerega.net:6443 \
      --context=internal | kubectl apply -f - --context=production

    Separating --gateway-addresses and --gateway-port made all the difference.

    I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.

    Service Export: Making Services Visible Across Clusters

    Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.

    For the Loki gateway (and similarly for Mimir and Tempo):

    gateway:
      service:
        labels:
          mirror.linkerd.io/exported: "true"

    Once the services were exported, they appeared in the production and nonproduction clusters with an `-internal` suffix:

    loki-gateway-internal.monitoring.svc.cluster.local

    mimir-gateway-internal.monitoring.svc.cluster.local

    tempo-gateway-internal.monitoring.svc.cluster.local

    Grafana Alloy: Switching to Mirrored Services

    The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:

    Before:

    loki.write "default" {
      endpoint {
        url = "https://loki.mattgerega.net/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    After:

    loki.write "default" {
      endpoint {
        url = "http://loki-gateway-internal.monitoring.svc.cluster.local/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.

    But wait—there’s one more step.

    The Linkerd Injection Gotcha

    Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.

    This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:

    alloy:
      controller:  # Not alloy.alloy!
        podAnnotations:
          linkerd.io/inject: enabled
      alloy:
        # ... other config

    Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:

    – `linkerd-proxy` (the magic sauce)

    – `alloy` (the telemetry collector)

    – `config-reloader` (for hot config reloads)

    Checking the gateway logs confirmed traffic was flowing:

    INFO inbound:server:gateway{dst=loki-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.5.4:8080
    INFO inbound:server:gateway{dst=mimir-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.9.18:8080
    INFO inbound:server:gateway{dst=tempo-gateway.monitoring.svc.cluster.local:4317}: Adding endpoint addr=10.42.10.13:4317

    Known Issues: Probe Health Checks

    There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.

    From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.

    What I Learned

    1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.

    2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.

    3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.

    4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.

    5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.

    The Result

    Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.

    I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.

    The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.

  • Migrating from MinIO to Garage

    When Open Source Isn’t So Open Anymore

    Sometimes migrations aren’t about chasing the newest technology—they’re about abandoning ship before it sinks. In December 2025, MinIO officially entered “maintenance mode” for its open-source edition, effectively ending active development. Combined with earlier moves like removing the admin UIdiscontinuing Docker images, and pushing users toward their $96,000+ AIStor paid product, the writing was on the wall: MinIO’s open-source days were over.

    Time to find a replacement.

    Why I Had to Leave MinIO

    Let’s be clear: MinIO used to be excellent open-source software. Past tense. Over the course of 2025, the company systematically dismantled what made it valuable for home lab and small-scale deployments:

    June 2025Removed the web admin console from the Community Edition. Features like bucket configuration, lifecycle policies, and account management became CLI-only—or you could pay for AIStor.

    October 2025Stopped publishing Docker images to Docker Hub. Want to run MinIO? Build it from source yourself.

    December 2025Placed the GitHub repository in “maintenance mode.” No new features, no enhancements, no pull request reviews. Only “critical security fixes…evaluated on a case-by-case basis.”

    The pattern was obvious: push users toward AIStor, a proprietary product starting at nearly $100k, by making the open-source version progressively less usable. The community called it what it was—a lock-in strategy disguised as “streamlining.”

    I’m not paying six figures for object storage in my home lab. Time to migrate.

    Enter Garage

    I needed S3-compatible storage that was:

    • Actually open source, not “open source until we change our minds”
    • Lightweight, suitable for single-node deployments
    • Actively maintained by a community that won’t pull the rug out

    Garage checked all the boxes. Built in Rust by the Deuxfleurs collective, it’s designed for geo-distributed deployments but scales down beautifully to single-node setups. More importantly, it’s genuinely open source—developed by a collective, not a company with a paid product to upsell.

    The Migration Process

    Vault: The Critical Path

    Vault was the highest-stakes piece of this migration. It’s the backbone of my secrets management, and getting this wrong meant potentially losing access to everything. I followed the proper migration path:

    1. Stopped the Vault pod in my Kubernetes cluster—no live migrations, no shortcuts
    2. Used vault operator migrate to transfer the storage backend from MinIO to Garage—this is the officially supported method that ensures data integrity
    3. Updated the vault-storage-config Kubernetes secret to point at the new Garage endpoint
    4. Restarted Vault and unsealed it with my existing keys

    The vault operator migrate command handled the heavy lifting, ensuring every key-value pair transferred correctly. While I could have theoretically just mirrored S3 buckets and updated configs, using the official migration tool gave me confidence nothing would break in subtle ways later.

    Monitoring Stack: Configuration Updates

    With Vault successfully migrated, the rest was straightforward. I updated S3 endpoint configurations across my monitoring stack in ops-internal-cluster:

    Loki, Mimir, and Tempo all had their storage backends updated:

    • Old: cloud.gerega.net:39000 (MinIO)
    • New: cloud.gerega.net:3900 (Garage)

    I intentionally didn’t migrate historical metrics and logs. This is a lab environment—losing a few weeks of time-series data just means starting fresh with cleaner retention policies. In production, you’d migrate this data. Here? Not worth the effort.

    Monitoring Garage Itself

    I added a Grafana Alloy scrape job to collect Garage’s Prometheus metrics from its /metrics endpoint. No blind spots from day one—if Garage has issues, I’ll know immediately.

    Deployment Architecture

    One deliberate choice: Garage runs as a single Docker container on bare metal, not in Kubernetes. Object storage is foundational infrastructure. If my Kubernetes clusters have problems, I don’t want my storage backend tied to that failure domain.

    Running Garage outside the cluster means:

    • Vault stores data independently of cluster state
    • Monitoring storage (Loki, Mimir, Tempo) persists during cluster maintenance
    • One less workload competing for cluster resources

    Verification and Cleanup

    Before decommissioning MinIO, I verified nothing was still pointing at the old endpoints:

    # Searched across GitOps repos
    grep -r "39000" .        # Old MinIO port
    grep -r "192.168.1.30" . # Old MinIO IP
    grep -r "s3.mattgerega.net" .
    

    Clean sweep—everything migrated successfully.

    Current Status

    Garage has been running for about a week now. Resource usage is lower than MinIO ever was, and everything works:

    • Vault sealed/unsealed multiple times without issues
    • Loki ingesting logs from multiple clusters
    • Mimir storing metrics from Grafana Alloy
    • Tempo collecting distributed traces

    The old MinIO instance is still running but idle. I’ll give it another week before decommissioning entirely—old habits die hard, and having a fallback during initial burn-in feels prudent.

    Port 3900 is the new standard. Port 39000 is legacy. And my infrastructure is no longer dependent on a company actively sabotaging its open-source product.

    Lessons for the Homelab Community

    If you’re still running MinIO Community Edition, now’s the time to plan your exit strategy. The maintenance-mode announcement wasn’t a surprise—it was the inevitable conclusion of a year-long strategy to push users toward paid products.

    Alternatives worth considering:

    • Garage: What I chose. Lightweight, Rust-based, genuinely open source.
    • SeaweedFS: Go-based, active development, designed for large-scale deployments but works at small scale.
    • Ceph RGW: If you’re already running Ceph, the RADOS Gateway provides S3 compatibility.

    The MinIO I deployed years ago was a solid piece of open-source infrastructure. The MinIO of 2025 is a bait-and-switch. Learn from my migration—don’t wait until you’re forced to scramble.


    Technical Details:

    • Garage deployment: Single Docker container on bare metal
    • Migration window: ~30 minutes for Vault migration
    • Vault migration methodvault operator migrate CLI command
    • Affected services: Vault, Loki, Mimir, Tempo, Grafana Alloy
    • Data retained: All Vault secrets, new metrics/logs only
    • Repositories: ops-argo, ops-internal-cluster
    • Garage version: Latest stable release as of December 2025

    References:

  • Modernizing the Gateway

    From NGINX Ingress to Envoy Gateway

    As with any good engineer, I cannot leave well enough alone. Over the past week, I’ve been working through a significant infrastructure modernization across my home lab clusters – migrating from NGINX Ingress to Envoy Gateway and implementing the Kubernetes Gateway API. This also involved some necessary housekeeping with chart updates and a shift to Server-Side Apply for all ArgoCD-managed resources.

    Why Change?

    The timing couldn’t have been better. In November 2024, the Kubernetes SIG Network and Security Response Committee announced that Ingress NGINX will be retired in March 2026. The project has struggled with insufficient maintainer support, security concerns around configuration snippets, and accumulated technical debt. After March 2026, there will be no further releases, security patches, or bug fixes.

    The announcement strongly recommends migrating to the Gateway API, described as “the modern replacement for Ingress.” This validated what I’d already been considering – the Gateway API provides a more standardized, vendor-neutral approach with better separation of concerns between infrastructure operators and application developers.

    Envoy Gateway, being a CNCF project built on the battle-tested Envoy proxy, seemed like a natural choice for this migration. Plus, it gave me an excuse to finally move off Traefik, which was… well, let’s just say it was time for a change.

    The Migration Journey

    The migration happened in phases across my ops-argoops-prod-cluster, and ops-nonprod-cluster repositories. Here’s what changed:

    Phase 1: Adding Envoy Gateway

    I started by adding Envoy Gateway as a cluster tool, complete with its own ApplicationSet that deploys to clusters labeled with spydersoft.io/envoy-gateway: "true". The deployment includes:

    • GatewayClass and Gateway resources: Defined a main gateway that handles traffic routing
    • EnvoyProxy configuration: Set up with a static NodePort service for consistent external access
    • ClientTrafficPolicy: Configured to properly handle forwarded headers – crucial for preserving client IP information through the proxy chain

    The Envoy Gateway deployment lives in the envoy-gateway-system namespace and exposes services via NodePort 30080 and 30443, making it easy to integrate with my existing network setup.

    Phase 2: Migrating Applications to HTTPRoute

    This was the bulk of the work. Each application needed its Ingress resource replaced with an HTTPRoute. The new Gateway API resources are much cleaner. For example, my blog (www.mattgerega.com) went from an Ingress definition to this:

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: wp-mattgerega
      namespace: sites
    spec:
      parentRefs:
        - name: main
          namespace: envoy-gateway-system
      hostnames:
        - www.mattgerega.com
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
            - name: wp-mattgerega-wordpress
              port: 80
    

    Much more declarative and expressive than the old Ingress syntax.

    I migrated several applications across both production and non-production clusters:

    • Gravitee API Management
    • ProGet (my package management system)
    • n8n and Node-RED instances
    • Linkerd-viz dashboard
    • ArgoCD (which also got a GRPCRoute for its gRPC services)
    • Identity Server (across test and stage environments)
    • Tech Radar
    • Home automation services (UniFi client and IP manager)

    Phase 3: Removing the Old Guard

    Once everything was migrated and tested, I removed the old ingress controller configurations. This cleanup happened across all three repositories:

    ops-prod-cluster:

    • Removed all Traefik configuration files
    • Cleaned up traefik-gateway.yaml and traefik-middlewares.yaml

    ops-nonprod-cluster:

    • Removed Traefik configurations
    • Deleted the RKE2 ingress NGINX HelmChartConfig (rke2-ingress-nginx-config.yaml)

    The cluster-resources directories got significantly cleaner with this cleanup. Good riddance to configuration sprawl.

    Phase 4: Chart Maintenance and Server-Side Apply

    While I was in there making changes, I also:

    • Bumped several Helm charts to their latest versions:
      • ArgoCD: 9.1.5 → 9.1.7
      • External Secrets: 1.1.0 → 1.1.1
      • Linkerd components: 2025.11.3 → 2025.12.1
      • Grafana Alloy: 1.4.0 → 1.5.0
      • Common chart dependency: 4.4.0 → 4.5.0
      • Redis deployments updated across production and non-production
    • Migrated all clusters to use Server-Side Apply (ServerSideApply=true in the syncOptions):
      • All cluster tools in ops-argo
      • Production application sets (external-apps, production-apps, cluster-resources)
      • Non-production application sets (external-apps, cluster-resources)

    This is a better practice for ArgoCD as it allows Kubernetes to handle three-way merge patches instead of client-side strategic merge, reducing conflicts and improving sync reliability.

    Lessons Learned

    Gateway API is ready for production: The migration was surprisingly smooth. The Gateway API resources are well-documented and intuitive. With NGINX Ingress being retired, now’s the time to make the jump.

    HTTPRoute vs. Ingress: HTTPRoute is more expressive and allows for more sophisticated routing rules. The explicit parentRefs concept makes it clear which gateway handles which routes.

    Server-Side Apply everywhere: Should have done this sooner. The improved conflict handling makes ArgoCD much more reliable, especially when multiple controllers touch the same resources.

    Envoy’s configurability: The EnvoyProxy custom resource gives incredible control over the proxy configuration without needing to edit ConfigMaps or deal with annotations.

    Multi-cluster consistency: Making these changes across production and non-production environments simultaneously kept everything aligned and reduced cognitive overhead when switching between environments.

    Current Status

    All applications across all clusters are now running through Envoy Gateway with the Gateway API. Traffic is flowing correctly, TLS is terminating properly, and I’ve removed all the old ingress-related configuration from both production and non-production environments.

    The clusters are more standardized, the configuration is cleaner, and I’m positioned to take advantage of future Gateway API features like traffic splitting and more advanced routing capabilities. More importantly, I’m ahead of the March 2026 retirement deadline with plenty of time to spare.

    Now, the real question: what am I going to tinker with next?

  • Summer Project – Home Lab Refactor

    As with any good engineer, I cannot leave well enough alone. My current rainy day project is reconfiguring my home lab for some much needed updates and simplification.

    What’s Wrong?

    My home lab is, well, still going strong. My automation scripts work well, and I don’t spend a ton of time doing what I need to do to keep things up to date, at least when it comes to my Kubernetes clusters.

    The other servers, however, are in a scary spot. Everything is running on top of the free version of Windows Hyper-V Server from 2019, so general updates are a concern. I would LOVE to move to Windows Server 2025, but I do not have the money for that kind of endeavor.

    The other issue with running a Windows Server is that, well, they usually expected a Windows Domain (or, at least, my version does). This requirement has forced me to run my own domain controllers for a number of years now. Earlier iterations of my lab included a lot of Windows VMs, so the domain helped me manage authentication across them all. But, with RKE2 and Kubernetes running the bulk of my workloads, the domain controllers are more hassle than anything right now.

    The Plan

    My current plan is to migrate my home server to Proxmox. It seems a pretty solid replacement for Hyper-V, and has a few features in it that I may use in the future, like using cloud-init for creating new cluster nodes and better management of storage.

    Obviously, this is going to require some testing, and luckily, my old laptop is free for some experimentation. So I installed Proxmox there and messed around, and I came up with an interesting plan.

    • Migrate my VMs to my laptop instance of Proxmox, reducing the workload as much as I can.
    • Install Proxmox on my server
    • Create a Proxmox cluster with my laptop and server as the nodes.
    • Transfer my VMs from the laptop node to the server node.

    Cutting my Workload

    My laptop is a paltry 32GB of RAM, compared to 288 GB in my server. While I need to get everything “over” to the laptop, it doesn’t all have to be running at the same time.

    For the windows VMs, my current plan is as follows:

    • Move my primary domain controller to the laptop, but run at a reduced capacity (1 CPU/2GB).
    • Move my backup DC to the laptop, shut it down.
    • Move and shut down both SQL Server instances: they are only running lab DBs, nothing really vital.

    For my clusters, I’m not actually going to “move” the VMs. I’m going to create new nodes on the laptop proxmox, add them to the clusters, and then deprovision the old ones. This gives me some control over what’s there.

    • Non-Production Cluster -> 1 control plane server, 2 agents, but shut them down.
    • Internal Cluster -> 1 control plane server (down from 3), 3 agents, all shut down.
    • Production Cluster -> 1 control plane (down from 3), 2 agents, running vital software. I may need to migrate my HC Vault instance to the production cluster just to ensure secrets stay up and running.

    With this setup, I should really only have 4 VMs running on my laptop, which it should be able to handle. Once that’s done, I’ll have time to install and configure Proxmox on the server, and then move VMs from the laptop to the server.

    Lots to do

    I have a lot of learning to do. Proxmox seems pretty simple to start, but I find I’m having to read a lot about the cloning and cloud-init pieces to really make use of the power of the tool.

    Once I feel comfortable with Proxmox, the actual move will need scheduled… So, maybe by Christmas I’ll actually have this done.

  • ArgoCD panicked a little…

    I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.

    Whatever you do, don’t panic!

    Well, unless of course you are ArgoCD.

    I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.

    Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.

    Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:

    Recovered from panic: runtime error: invalid memory address or nil pointer

    A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.

    Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.

    Versioning….

    When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.

    My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.

    This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.

  • Platform Engineering

    As I continue to build out some reference architecture applications, I realized that there was a great deal of boilerplate code that I add to my APIs to get things running. Time for a library!

    Enter the “Platform”

    I am generally terrible at naming things, but Spydersoft.Platform seemed like a good base namespace for this one. The intent is to put the majority of my boilerplate code into a set of libraries that can be referenced to make adding stuff easier.

    But, what kind of “stuff?” Well, for starters

    • Support for OpenTelemetry trace, metrics, and logging
    • Serilog logging for console logging
    • Simple JWT identity authentication (for my APIs)
    • Default Health Check endpoints

    Going deep with Health Checks

    The first three were pretty easy: just some POCOs for options and then startup extensions to add the necessary items with the proper configuration. With health checks, however, I went a little overboard.

    My goal was to be able to implement IHealthCheck anywhere and decorate it in such a way that it would be added to the health check framework and could be tagged. Furthermore, I wanted to use tags to drive standard endpoints.

    In the end, I used a custom attribute and some reflection to add the checks that are found in the loaded AppDomain. I won’t bore you: the documentation should do that just fine.

    But can we test it?

    Testing startup extensions is, well, interesting. Technically, it is an integration test, but I did not want to setup playwright tests to execute the API tests. Why? Well, usually API integration tests are run again a particular configuration, but in this case, I needed to run the reference application with a lot of different configurations in order to fully test the extensions. Enter WebApplicationFactory.

    With WebApplicationFactory, I was able to configure tests to stand up a copy of the reference application with different configurations. I could then verify the configuration using some custom health checks.

    I am on the fence as to whether or not this is a “unit” test or an “integration” test. I’m not calling out to any other application, which is usually the definition of an integration test. But I did have to configure a reference application in order to get things tested.

    Whatever you call it, I have coverage on my startup extensions, and even caught a few bugs while I was writing the tests.

    Make it truly public?

    Right now, the build publishes the Nuget package to my private nuget feed. I am debating on moving it to Nuget (or maybe Github’s package feeds). While the code is open source, I want to make the library openly available. But until I make the decision on where to put it, I will keep it in my private feed. If you have any interest in it, watch or star the repo in GitHub: it will help me gauge the level of interest.

  • Supporting a GitHub Release Flow with Azure DevOps Builds

    It has been a busy few months, and with the weather changing, I have a little more time in front of the computer for hobby work. Some of my public projects were in need of a few package updates, so I started down that road. Most of the updates were pretty simple: a few package updates and some Azure DevOps step template updates and I was ready to go. However, I had been delaying my upgrade to GitVersion 6, and in taking that leap, I changed my deployment process slightly.

    Original State

    My current development process supports three environments: test, stage, and production. Commits to feature/* branches are automatically deployed to the test environment, and any builds from main are first deployed to stage and then can be deployed to production.

    For me, this works: I am usually only working on one branch at a time, so publishing feature branches to the test environment works. When I am done with a branch, I merge it into main and get it deployed.

    New State

    As I have been working through some processes at work, it occurred to me that versions are about release, not necessarily commits. While commits can help us number releases, they shouldn’t be the driving force. GitVersion 6 and its new workflow defaults drive this home.

    So my new state would be pretty similar: feature/* branches get deployed to the test environment automatically. The difference lies in main: I no longer want to release with every commit to main. I want to be able to control releases through the use of tags (and GitHub releases, which generate tags.

    So I flipped over to GitVersion 6 and modified my GitVersion.yml file:

    workflow: GitHubFlow/v1
    merge-message-formats:
      pull-request: 'Merge pull request \#(?<PullRequestNumber>\d+) from'
    branches:
      feature:
        mode: ContinuousDelivery

    I modified my build pipeline to always build, but only trigger code release for feature/* branch builds and builds from a tag. I figured this would work fine, but Azure DevOps threw me a curve ball.

    Azure DevOps Checkouts

    When you build from a tag, Azure DevOps checks that tag out directly, using the /tags/<tagname> branch reference. When I tried to run GitVersion on this, I got a weird branch number: A build on tag 1.3.0 resulted in 1.3.1-tags-1-3-0.1.

    I dug into GitVersion’s default configuration, and noticed this corresponded with the unknown branch configuration. To get around Azure Devops, I had to do configure the tags/ branches:

    workflow: GitHubFlow/v1
    merge-message-formats:
      pull-request: 'Merge pull request \#(?<PullRequestNumber>\d+) from'
    branches:
      feature:
        mode: ContinuousDelivery
      tags:
        mode: ManualDeployment
        label: ''
        increment: Inherit
        prevent-increment:
          when-current-commit-tagged: true
        source-branches:
        - main
        track-merge-message: true
        regex: ^tags?[/-](?<BranchName>.+)
        is-main-branch: true

    This treats tags as main branches when calculating the version.

    Caveat Emptor

    This works if you ONLY tag your main branch. If you are in the habit of tagging other branches, this will not work for you. However, I only ever release from main branches, and I am in a fix-forward scenario, so this works for me. If you use release/* branches and need builds from there, you may need additional GitVersion configuration to get the correct version numbers to generate.

  • A Quick WSL Swap

    I have been using WSL and Ubuntu 22.04 a lot more in recent weeks. From virtual environments for Python development to the ability to use Podman to run container images, the tooling supports some of the work I do much better than Windows does.

    But Ubuntu 22.04 is old! I love the predictable LTS releases, but two years is an eternity in software, and I was looking forward to the 22.04 release.

    Upgrade or Fresh Start?

    I looked at a few options for upgrading my existing Ubuntu 22.04 WSL instance, but I really did not like what I read. The guidance basically suggested it was a “try at your own risk” scenario.

    I took a quick inventory of what was actually on my WSL image. As it turns out, not too much. Aside from some of my standard profile settings, I only have a few files that were not available in some of my Github repositories. Additionally, since you can have multiple instances of WSL running, the easiest solution I could find was to stand up a new 24.04 image and copy my settings and files over.

    Is that it?

    Shockingly, yes. Installing 24.04 is as simple as opening it in the Microsoft store and downloading it. Once that was done, I ran through the quick provisioning to setup the basics, and then copied my profile and file.

    I was able to utilize scp for most of the copying, although I also realized that I could copy files from Windows using the \\wsl.localhost paths. Either way, it didn’t take very long before I had Ubuntu 24.04 up and running.

    I still have 22.04 installed, and I haven’t deleted that image just yet. I figure I’ll keep it around for another month and, if I don’t have to turn it back on, I probably don’t need anything on it.

  • Drop that zero…

    I ran into a very weird issue with Nuget packages and the old packages.config reference style.

    Nuget vs Semantic Versioning

    Nuget grew up in Windows, where assembly version numbers support four numbers: major.minor.build.revision. Therefore, NugetVersion supports all four version segments. Semantic versioning, on the other hand, supports three numbers plus additional labels.

    As part of Nuget’s version normalization, in an effort to better support semantic versioning, the fourth segment version is dropped if it’s zero. So 1.2.3.0 becomes 1.2.3. In general, this does not present any problems, since the version numbers are retrieved from the feed by the package manager tools and references updated accordingly.

    Always use the tools provided

    When you ignore the tooling, well, stuff can get weird. This is particularly true in the old packages.config reference style.

    In that style, packages are listed in a packages.config file, and the .Net project file adds a reference to the DLL with a HintPath. That HintPath includes the folder where the package is installed, something like this:

     <ItemGroup>
        <Reference Include="MyCustomLibrary, Version=1.2.3.4, Culture=neutral, processorArchitecture=MSIL">
          <HintPath>..\packages\MyCustomLibrary.1.2.3.4\lib\net472\MyCustomLibrary.dll</HintPath>
        </Reference>
    </ItemGroup>

    But, for argument’s sake, let us assume we publish a new version of MyCustomLibrary, version 1.2.4. Even though the AssemblyVersion might be 1.2.4.0, the Nuget version will be normalized to 1.2.4. And, instead of upgrading the package using one of the package manager tools, you just update the reference file manually, like this:

    <ItemGroup>
        <Reference Include="MyCustomLibrary, Version=1.2.4.0, Culture=neutral, processorArchitecture=MSIL">
          <HintPath>..\packages\MyCustomLibrary.1.2.4.0\lib\net472\MyCustomLibrary.dll</HintPath>
        </Reference>
    </ItemGroup>

    This can cause weird issues. It will most likely build with a warning about not being able to find the DLL. Depending on how the package is used or referenced, you may not get a build error (I didn’t get one). But the build did not include the required library.

    Moving on…

    The “fix” is easy: use the Nuget tools (either the CLI or Visual Studio Package Manager) to update the packages. It will generate the appropriate HintPath for the package that is installed. An even better solution is to migrate to project reference style, where the project includes the Nuget references, and packages.config is not used. This presents immediate errors if an incorrect version is used.

  • Isolating your Azure Functions

    I spent a good bit of time over the last two weeks converting our Azure functions from the in-process to the isolated worker process model. Overall the transition was fairly simple, but there were a few bumps in the proverbial road worth noting.

    Migration Process

    Microsoft Learn has a very detailed How To Guide for this migration. The guide includes steps for updating the project file and references, as well as additional packages that are required based on various trigger types.

    Since I had a number of functions to process, I followed the guide for the first one, and that worked swimmingly. However, then I got lazy and started the “copy-paste” conversion. In that laziness, I missed a particular section of the project file:

    <ItemGroup>
      <Using Include="System.Threading.ExecutionContext" Alias="ExecutionContext"/>
    </ItemGroup>

    Unfortunately, if you forget this, you will not break your local development environment. However, when you publish to a function, it will not execute correctly.

    Fixing Dependency Injection

    When using the in-process model, there are some “freebies” that get added to the dependency injection system, as if by magic. ILogger, in particular, was allowed to be automatically injected into the function (as a function parameter). However, in the in-process model, you must get ILogger from either the FunctionContext or through dependency injection into the class.

    As part of our conversion, we removed the function parameters for ILogger and replaced them with service instances retrieved through dependency injection at the class level.

    What we did not realize until we got our functions into the test environments was that IHttpContextAccessor was not available in the isolated model. Apparently, that particular interface is available as part of the in-process model automatically, but is not added as part of the isolated model. So we had to add an instance of IHttpContextAccessor to our services collection in the Program.cs file.

    It is never easy

    Upgrades or migrations are never just “change this and go.” as much as we try to make it easy, there always seems to be a little change here or there that end up being a fly in the ointment. In our case, we simply assumed that IHttpContextAccessor was there because in-process put it there, and the code which needed that was a few layers deep in the dependency tree. The only way to find it was to make the change and see what breaks. And that is what keeps quality engineers up at night.