Category: Open Source

  • Migrating from MinIO to Garage

    When Open Source Isn’t So Open Anymore

    Sometimes migrations aren’t about chasing the newest technology—they’re about abandoning ship before it sinks. In December 2025, MinIO officially entered “maintenance mode” for its open-source edition, effectively ending active development. Combined with earlier moves like removing the admin UIdiscontinuing Docker images, and pushing users toward their $96,000+ AIStor paid product, the writing was on the wall: MinIO’s open-source days were over.

    Time to find a replacement.

    Why I Had to Leave MinIO

    Let’s be clear: MinIO used to be excellent open-source software. Past tense. Over the course of 2025, the company systematically dismantled what made it valuable for home lab and small-scale deployments:

    June 2025Removed the web admin console from the Community Edition. Features like bucket configuration, lifecycle policies, and account management became CLI-only—or you could pay for AIStor.

    October 2025Stopped publishing Docker images to Docker Hub. Want to run MinIO? Build it from source yourself.

    December 2025Placed the GitHub repository in “maintenance mode.” No new features, no enhancements, no pull request reviews. Only “critical security fixes…evaluated on a case-by-case basis.”

    The pattern was obvious: push users toward AIStor, a proprietary product starting at nearly $100k, by making the open-source version progressively less usable. The community called it what it was—a lock-in strategy disguised as “streamlining.”

    I’m not paying six figures for object storage in my home lab. Time to migrate.

    Enter Garage

    I needed S3-compatible storage that was:

    • Actually open source, not “open source until we change our minds”
    • Lightweight, suitable for single-node deployments
    • Actively maintained by a community that won’t pull the rug out

    Garage checked all the boxes. Built in Rust by the Deuxfleurs collective, it’s designed for geo-distributed deployments but scales down beautifully to single-node setups. More importantly, it’s genuinely open source—developed by a collective, not a company with a paid product to upsell.

    The Migration Process

    Vault: The Critical Path

    Vault was the highest-stakes piece of this migration. It’s the backbone of my secrets management, and getting this wrong meant potentially losing access to everything. I followed the proper migration path:

    1. Stopped the Vault pod in my Kubernetes cluster—no live migrations, no shortcuts
    2. Used vault operator migrate to transfer the storage backend from MinIO to Garage—this is the officially supported method that ensures data integrity
    3. Updated the vault-storage-config Kubernetes secret to point at the new Garage endpoint
    4. Restarted Vault and unsealed it with my existing keys

    The vault operator migrate command handled the heavy lifting, ensuring every key-value pair transferred correctly. While I could have theoretically just mirrored S3 buckets and updated configs, using the official migration tool gave me confidence nothing would break in subtle ways later.

    Monitoring Stack: Configuration Updates

    With Vault successfully migrated, the rest was straightforward. I updated S3 endpoint configurations across my monitoring stack in ops-internal-cluster:

    Loki, Mimir, and Tempo all had their storage backends updated:

    • Old: cloud.gerega.net:39000 (MinIO)
    • New: cloud.gerega.net:3900 (Garage)

    I intentionally didn’t migrate historical metrics and logs. This is a lab environment—losing a few weeks of time-series data just means starting fresh with cleaner retention policies. In production, you’d migrate this data. Here? Not worth the effort.

    Monitoring Garage Itself

    I added a Grafana Alloy scrape job to collect Garage’s Prometheus metrics from its /metrics endpoint. No blind spots from day one—if Garage has issues, I’ll know immediately.

    Deployment Architecture

    One deliberate choice: Garage runs as a single Docker container on bare metal, not in Kubernetes. Object storage is foundational infrastructure. If my Kubernetes clusters have problems, I don’t want my storage backend tied to that failure domain.

    Running Garage outside the cluster means:

    • Vault stores data independently of cluster state
    • Monitoring storage (Loki, Mimir, Tempo) persists during cluster maintenance
    • One less workload competing for cluster resources

    Verification and Cleanup

    Before decommissioning MinIO, I verified nothing was still pointing at the old endpoints:

    # Searched across GitOps repos
    grep -r "39000" .        # Old MinIO port
    grep -r "192.168.1.30" . # Old MinIO IP
    grep -r "s3.mattgerega.net" .
    

    Clean sweep—everything migrated successfully.

    Current Status

    Garage has been running for about a week now. Resource usage is lower than MinIO ever was, and everything works:

    • Vault sealed/unsealed multiple times without issues
    • Loki ingesting logs from multiple clusters
    • Mimir storing metrics from Grafana Alloy
    • Tempo collecting distributed traces

    The old MinIO instance is still running but idle. I’ll give it another week before decommissioning entirely—old habits die hard, and having a fallback during initial burn-in feels prudent.

    Port 3900 is the new standard. Port 39000 is legacy. And my infrastructure is no longer dependent on a company actively sabotaging its open-source product.

    Lessons for the Homelab Community

    If you’re still running MinIO Community Edition, now’s the time to plan your exit strategy. The maintenance-mode announcement wasn’t a surprise—it was the inevitable conclusion of a year-long strategy to push users toward paid products.

    Alternatives worth considering:

    • Garage: What I chose. Lightweight, Rust-based, genuinely open source.
    • SeaweedFS: Go-based, active development, designed for large-scale deployments but works at small scale.
    • Ceph RGW: If you’re already running Ceph, the RADOS Gateway provides S3 compatibility.

    The MinIO I deployed years ago was a solid piece of open-source infrastructure. The MinIO of 2025 is a bait-and-switch. Learn from my migration—don’t wait until you’re forced to scramble.


    Technical Details:

    • Garage deployment: Single Docker container on bare metal
    • Migration window: ~30 minutes for Vault migration
    • Vault migration methodvault operator migrate CLI command
    • Affected services: Vault, Loki, Mimir, Tempo, Grafana Alloy
    • Data retained: All Vault secrets, new metrics/logs only
    • Repositories: ops-argo, ops-internal-cluster
    • Garage version: Latest stable release as of December 2025

    References:

  • Modernizing the Gateway

    From NGINX Ingress to Envoy Gateway

    As with any good engineer, I cannot leave well enough alone. Over the past week, I’ve been working through a significant infrastructure modernization across my home lab clusters – migrating from NGINX Ingress to Envoy Gateway and implementing the Kubernetes Gateway API. This also involved some necessary housekeeping with chart updates and a shift to Server-Side Apply for all ArgoCD-managed resources.

    Why Change?

    The timing couldn’t have been better. In November 2024, the Kubernetes SIG Network and Security Response Committee announced that Ingress NGINX will be retired in March 2026. The project has struggled with insufficient maintainer support, security concerns around configuration snippets, and accumulated technical debt. After March 2026, there will be no further releases, security patches, or bug fixes.

    The announcement strongly recommends migrating to the Gateway API, described as “the modern replacement for Ingress.” This validated what I’d already been considering – the Gateway API provides a more standardized, vendor-neutral approach with better separation of concerns between infrastructure operators and application developers.

    Envoy Gateway, being a CNCF project built on the battle-tested Envoy proxy, seemed like a natural choice for this migration. Plus, it gave me an excuse to finally move off Traefik, which was… well, let’s just say it was time for a change.

    The Migration Journey

    The migration happened in phases across my ops-argoops-prod-cluster, and ops-nonprod-cluster repositories. Here’s what changed:

    Phase 1: Adding Envoy Gateway

    I started by adding Envoy Gateway as a cluster tool, complete with its own ApplicationSet that deploys to clusters labeled with spydersoft.io/envoy-gateway: "true". The deployment includes:

    • GatewayClass and Gateway resources: Defined a main gateway that handles traffic routing
    • EnvoyProxy configuration: Set up with a static NodePort service for consistent external access
    • ClientTrafficPolicy: Configured to properly handle forwarded headers – crucial for preserving client IP information through the proxy chain

    The Envoy Gateway deployment lives in the envoy-gateway-system namespace and exposes services via NodePort 30080 and 30443, making it easy to integrate with my existing network setup.

    Phase 2: Migrating Applications to HTTPRoute

    This was the bulk of the work. Each application needed its Ingress resource replaced with an HTTPRoute. The new Gateway API resources are much cleaner. For example, my blog (www.mattgerega.com) went from an Ingress definition to this:

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: wp-mattgerega
      namespace: sites
    spec:
      parentRefs:
        - name: main
          namespace: envoy-gateway-system
      hostnames:
        - www.mattgerega.com
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
            - name: wp-mattgerega-wordpress
              port: 80
    

    Much more declarative and expressive than the old Ingress syntax.

    I migrated several applications across both production and non-production clusters:

    • Gravitee API Management
    • ProGet (my package management system)
    • n8n and Node-RED instances
    • Linkerd-viz dashboard
    • ArgoCD (which also got a GRPCRoute for its gRPC services)
    • Identity Server (across test and stage environments)
    • Tech Radar
    • Home automation services (UniFi client and IP manager)

    Phase 3: Removing the Old Guard

    Once everything was migrated and tested, I removed the old ingress controller configurations. This cleanup happened across all three repositories:

    ops-prod-cluster:

    • Removed all Traefik configuration files
    • Cleaned up traefik-gateway.yaml and traefik-middlewares.yaml

    ops-nonprod-cluster:

    • Removed Traefik configurations
    • Deleted the RKE2 ingress NGINX HelmChartConfig (rke2-ingress-nginx-config.yaml)

    The cluster-resources directories got significantly cleaner with this cleanup. Good riddance to configuration sprawl.

    Phase 4: Chart Maintenance and Server-Side Apply

    While I was in there making changes, I also:

    • Bumped several Helm charts to their latest versions:
      • ArgoCD: 9.1.5 → 9.1.7
      • External Secrets: 1.1.0 → 1.1.1
      • Linkerd components: 2025.11.3 → 2025.12.1
      • Grafana Alloy: 1.4.0 → 1.5.0
      • Common chart dependency: 4.4.0 → 4.5.0
      • Redis deployments updated across production and non-production
    • Migrated all clusters to use Server-Side Apply (ServerSideApply=true in the syncOptions):
      • All cluster tools in ops-argo
      • Production application sets (external-apps, production-apps, cluster-resources)
      • Non-production application sets (external-apps, cluster-resources)

    This is a better practice for ArgoCD as it allows Kubernetes to handle three-way merge patches instead of client-side strategic merge, reducing conflicts and improving sync reliability.

    Lessons Learned

    Gateway API is ready for production: The migration was surprisingly smooth. The Gateway API resources are well-documented and intuitive. With NGINX Ingress being retired, now’s the time to make the jump.

    HTTPRoute vs. Ingress: HTTPRoute is more expressive and allows for more sophisticated routing rules. The explicit parentRefs concept makes it clear which gateway handles which routes.

    Server-Side Apply everywhere: Should have done this sooner. The improved conflict handling makes ArgoCD much more reliable, especially when multiple controllers touch the same resources.

    Envoy’s configurability: The EnvoyProxy custom resource gives incredible control over the proxy configuration without needing to edit ConfigMaps or deal with annotations.

    Multi-cluster consistency: Making these changes across production and non-production environments simultaneously kept everything aligned and reduced cognitive overhead when switching between environments.

    Current Status

    All applications across all clusters are now running through Envoy Gateway with the Gateway API. Traffic is flowing correctly, TLS is terminating properly, and I’ve removed all the old ingress-related configuration from both production and non-production environments.

    The clusters are more standardized, the configuration is cleaner, and I’m positioned to take advantage of future Gateway API features like traffic splitting and more advanced routing capabilities. More importantly, I’m ahead of the March 2026 retirement deadline with plenty of time to spare.

    Now, the real question: what am I going to tinker with next?

  • Summer Project – Home Lab Refactor

    As with any good engineer, I cannot leave well enough alone. My current rainy day project is reconfiguring my home lab for some much needed updates and simplification.

    What’s Wrong?

    My home lab is, well, still going strong. My automation scripts work well, and I don’t spend a ton of time doing what I need to do to keep things up to date, at least when it comes to my Kubernetes clusters.

    The other servers, however, are in a scary spot. Everything is running on top of the free version of Windows Hyper-V Server from 2019, so general updates are a concern. I would LOVE to move to Windows Server 2025, but I do not have the money for that kind of endeavor.

    The other issue with running a Windows Server is that, well, they usually expected a Windows Domain (or, at least, my version does). This requirement has forced me to run my own domain controllers for a number of years now. Earlier iterations of my lab included a lot of Windows VMs, so the domain helped me manage authentication across them all. But, with RKE2 and Kubernetes running the bulk of my workloads, the domain controllers are more hassle than anything right now.

    The Plan

    My current plan is to migrate my home server to Proxmox. It seems a pretty solid replacement for Hyper-V, and has a few features in it that I may use in the future, like using cloud-init for creating new cluster nodes and better management of storage.

    Obviously, this is going to require some testing, and luckily, my old laptop is free for some experimentation. So I installed Proxmox there and messed around, and I came up with an interesting plan.

    • Migrate my VMs to my laptop instance of Proxmox, reducing the workload as much as I can.
    • Install Proxmox on my server
    • Create a Proxmox cluster with my laptop and server as the nodes.
    • Transfer my VMs from the laptop node to the server node.

    Cutting my Workload

    My laptop is a paltry 32GB of RAM, compared to 288 GB in my server. While I need to get everything “over” to the laptop, it doesn’t all have to be running at the same time.

    For the windows VMs, my current plan is as follows:

    • Move my primary domain controller to the laptop, but run at a reduced capacity (1 CPU/2GB).
    • Move my backup DC to the laptop, shut it down.
    • Move and shut down both SQL Server instances: they are only running lab DBs, nothing really vital.

    For my clusters, I’m not actually going to “move” the VMs. I’m going to create new nodes on the laptop proxmox, add them to the clusters, and then deprovision the old ones. This gives me some control over what’s there.

    • Non-Production Cluster -> 1 control plane server, 2 agents, but shut them down.
    • Internal Cluster -> 1 control plane server (down from 3), 3 agents, all shut down.
    • Production Cluster -> 1 control plane (down from 3), 2 agents, running vital software. I may need to migrate my HC Vault instance to the production cluster just to ensure secrets stay up and running.

    With this setup, I should really only have 4 VMs running on my laptop, which it should be able to handle. Once that’s done, I’ll have time to install and configure Proxmox on the server, and then move VMs from the laptop to the server.

    Lots to do

    I have a lot of learning to do. Proxmox seems pretty simple to start, but I find I’m having to read a lot about the cloning and cloud-init pieces to really make use of the power of the tool.

    Once I feel comfortable with Proxmox, the actual move will need scheduled… So, maybe by Christmas I’ll actually have this done.

  • ArgoCD panicked a little…

    I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.

    Whatever you do, don’t panic!

    Well, unless of course you are ArgoCD.

    I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.

    Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.

    Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:

    Recovered from panic: runtime error: invalid memory address or nil pointer

    A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.

    Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.

    Versioning….

    When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.

    My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.

    This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.

  • Platform Engineering

    As I continue to build out some reference architecture applications, I realized that there was a great deal of boilerplate code that I add to my APIs to get things running. Time for a library!

    Enter the “Platform”

    I am generally terrible at naming things, but Spydersoft.Platform seemed like a good base namespace for this one. The intent is to put the majority of my boilerplate code into a set of libraries that can be referenced to make adding stuff easier.

    But, what kind of “stuff?” Well, for starters

    • Support for OpenTelemetry trace, metrics, and logging
    • Serilog logging for console logging
    • Simple JWT identity authentication (for my APIs)
    • Default Health Check endpoints

    Going deep with Health Checks

    The first three were pretty easy: just some POCOs for options and then startup extensions to add the necessary items with the proper configuration. With health checks, however, I went a little overboard.

    My goal was to be able to implement IHealthCheck anywhere and decorate it in such a way that it would be added to the health check framework and could be tagged. Furthermore, I wanted to use tags to drive standard endpoints.

    In the end, I used a custom attribute and some reflection to add the checks that are found in the loaded AppDomain. I won’t bore you: the documentation should do that just fine.

    But can we test it?

    Testing startup extensions is, well, interesting. Technically, it is an integration test, but I did not want to setup playwright tests to execute the API tests. Why? Well, usually API integration tests are run again a particular configuration, but in this case, I needed to run the reference application with a lot of different configurations in order to fully test the extensions. Enter WebApplicationFactory.

    With WebApplicationFactory, I was able to configure tests to stand up a copy of the reference application with different configurations. I could then verify the configuration using some custom health checks.

    I am on the fence as to whether or not this is a “unit” test or an “integration” test. I’m not calling out to any other application, which is usually the definition of an integration test. But I did have to configure a reference application in order to get things tested.

    Whatever you call it, I have coverage on my startup extensions, and even caught a few bugs while I was writing the tests.

    Make it truly public?

    Right now, the build publishes the Nuget package to my private nuget feed. I am debating on moving it to Nuget (or maybe Github’s package feeds). While the code is open source, I want to make the library openly available. But until I make the decision on where to put it, I will keep it in my private feed. If you have any interest in it, watch or star the repo in GitHub: it will help me gauge the level of interest.

  • Tech Tip – Formatting External Secrets in Helm

    This has tripped me up a lot, so I figure it is worth a quick note.

    The Problem

    I use Helm charts to define the state of my cluster in a Git repository, and ArgoCD to deploy those charts. This allows a lot of flexibility in my deployments and configuration.

    For secrets management, I use External Secrets to populate secrets from Hashicorp Vault. In many of those cases, I need to use the templating functionality of External Secrets to build secrets that can be used from external charts. A great case of this is populating user secrets for the RabbitMQ chart.

    In the link above, you will notice the templates/default-user-secrets.yaml file. This file is meant to generate a Kubernetes Secret resource which is then sent to the RabbitMqCluster resource (templates/cluster.yaml). This secret is mounted as a file, and therefore, needs some custom formatting. So I used the template property to format the secret:

    template:
      type: Opaque
      engineVersion: v2
      data:
        default_user.conf: |
            default_user={{ `{{ .username  }}` }}
            default_pass={{ `{{ .password  }}` }}
        host: {{ .Release.Name }}.rabbitmq.svc
        password: {{`"{{ .password }}"`}}
        port: "5672"
        provider: rabbitmq
        type: rabbitmq
        username: {{`"{{ .username }}"`}}

    Notice in the code above the duplicated {{ and }} around the username/password values. These are necessary to ensure that the template is properly set in the ExternalSecret resource.

    But, Why?

    It has to do with templating. Helm uses golang templates to process the templates and create resources. Similarly, the ExternalSecrets template engine uses golang templates. When you have a “template in a template”, you have to somehow tell the processor to put the literal value in.

    Let’s look at one part of this file.

      default_user={{ `{{ .username  }}` }}

    What we want to end up in the ExternalSecret template is this:

    default_user={{ .username  }}

    So, in order to do that, we have to tell the Helm template to write {{ .username }} as written, not processing it as a golang template. In this case, we use the backtick (`) to allow for this escape without having that value written to the template. Notice that other areas use the double-quote (“) to wrap the template.

    password: {{`"{{ .password }}"`}}

    This will generate the quotes in the resulting template:

    password: "{{ .password }}"

    If you need a single quote, the use the same pattern, but replace the double quote with a single quote (‘).

    username: {{`'{{ .username }}'`}}

    For whatever it is worth, VS Code’s YAML parser did not like that version at all. Since I have not run into a situation where I need a single quote, I use double quotes if quotes are required, and backticks if they are not.

  • Spoolman for Filament Management

    “You don’t know what you go ’til it’s gone” is a great song line, but a terrible inventory management approach. As I start to stock up on filament for the 3D printer, it occurred to me that I need a way to track my inventory.

    The Community Comes Through

    I searched around for different filament management solutions and landed on Spoolman. It seemed a pretty solid fit for what I needed. The owner also configured builds for container images, so it was fairly easy to configure a custom chart to run an instance on my internal tools cluster.

    The client UI is pretty easy to use, and the ability to add extra fields to the different modules makes the solution very extensible. I was immediately impressed and started entering information about vendors, filaments, and spools.

    Enhancing the Solution

    Since I am using a Bambu Labs printer and Bambu Studio, I do not have the ability to integrate Bambu into Spoolman to report filament usage. I searched around, but it does not seem that the Bambu reports such usage.

    My current plan for managing filament is by weight the spool when I open it, and then weighing it again after each use. That difference is the amount of filament I have used. But, to calculate the amount remaining, I need to know the weight of an empty spool. Assuming most manufacturers use the same spools, that shouldn’t be too hard to figure out long term.

    Spoolman is not quite set up for that type of usage. Weight and spool weight is set at the filament level and cannot be overridden at the spool level. Most spools will not be exactly 1000g of filament, so the need to track initial weight at the spool level is critical. Additionally, I want to support partial spools, including re-spooling.

    So, using all the Python I have learned recently, I took a crack at updating the API and UI to support this very scenario. In a “do no harm” type of situation, I made sure that I had all the integration tests running correctly, then went about adding the new fields and some of the new default functionality. After I had the updated functionality in place, I added a few new integration test to verify my work.

    Oddly, as I started working it, I found 4 feature requests in that were related to the changes I was suggesting. It took me a few nights, but I generated a pull request for the changes.

    And Now, We Wait…

    With my PR in place, I wait. The beauty of open source is that anyone can contribute, but the owners have the final say. This also means the owners need to respond, and most owners aren’t doing this as a full time job. So sometimes, there isn’t anything to do but wait.

    I’m hopeful that my changes will be accepted, but for now, I’m using Spoolman as-is, and just doing some of the “math” myself. It is definitely helping me keep track of my filament, and I’m keeping an eye on possible integrations with the Bambu ecosystem.

  • Updated Site Monitoring

    What seemed like forever ago, I put together a small project for simple site monitoring. My md-to-conf work enhanced my Python skills, and I thought it would be a good time to update the monitoring project.

    Housekeeping!

    First things first: I transferred the repository from my personal GitHub account to the spydersoft-consulting organization. Why? Separation of concerns, mostly. Since I fork open source repositories into my personal, I do not want the open source projects I am publishing to be mixed in with those forks.

    After that, I went through the process of converting my source to a package with GitHub Actions to build and publish to PyPi.org. I also added testing, formatting, and linting, copying settings and actions from the md_to_conf project.

    Oh, SonarQube

    Adding the linting with SonarQube added a LOT of new warnings and errors. Everything from long lines to bad variable names. Since my build process does not succeed if those types of things are found, I went through the process of fixing all those warnings.

    The variable naming ones were a little difficult, as some of my classes mapped to the configuration file serialization. That meant that I had to change my configuration files as well as the code. I went through a few iterations, as I missed some.

    I also had to add a few tests, just so that the tests and coverage scripts get run. Could I have omitted the tests entirely? Sure. But a few tests to read some sample configuration files never hurt anyone.

    Complete!

    I got everything renamed and building pretty quickly, and added my PyPi.org API token to the repository for the actions. I quickly provisioned a new analysis project in SonarCloud, and merged everything into main. Created a new GitHub release, which triggered a new publish to PyPi.org.

    Setting up the Raspberry Pi

    The last step was to get rid of the code on the Raspberry Pi, and use pip to install the package. This was relatively easy, with a few caveats.

    1. Use pip3 install instead of pip – Forgot the old Pi has both Python 2 and 3 installed.
    2. Fix the config files – I had to change my configuration file to reflect the variable name changes.
    3. Change the cron job – This one needs a little more explanation

    For the last one, when changing the cron job, I had to point specifically to /usr/local/bin/pi-monitor, since that’s where pip installed it. My new cron job looks like this:

    SHELL=/bin/bash
    
    */5 * * * * pi cd /home/pi && /usr/local/bin/pi-monitor -c monitor.config.json 2>&1 | /usr/bin/logger -t PIMONITOR

    That runs the application and logs everything to syslog with the PIMONITOR tag.

    Did this take longer than I expected? Yea, a little. Is it nice to have another open source project in my portfolio. Absolutely. Check it out if you are interested!

  • Terraform Azure AD

    Over the last week or so, I realized that while I bang the drum of infrastructure as code very loudly, I have not been practicing it at home. I took some steps to reconcile that over the weekend.

    The Goal

    I have a fairly meager home presence in Azure. Primarily, I use a free version of Azure Active Directory (now Entra ID) to allow for some single sign-on capabilities in external applications like Grafana, MinIO, and ArgoCD. The setup for this differs greatly among the applications, but common to all of these is the need to create applications in Azure AD.

    My goal is simple: automate provisioning of this Azure AD account so that I can manage these applications in code. My stretch goal was to get any secrets created as part of this process into my Hashicorp Vault instance.

    Getting Started

    The plan, in one word, is Terraform. Terraform has a number of providers, including both the azuread and vault providers. Additionally, since I have some experience in Terraform, I figured it would be a quick trip.

    I started by installing all the necessary tools (specifically, the Vault CLI, the Azure CLI, and the Terraform CLI) in my WSL instance of Ubuntu. Why there instead of Powershell? Most of the tutorials and such lean towards the bash syntax, so it was a bit easier to roll through the tutorials without having to convert bash into powershell.

    I used my ops-automation repository as the source for this, and started by creating a new folder structure to hold my projects. As I anticipated more Terraform projects to come up, I created a base terraform directory, and then an azuread directory under that.

    Picking a Backend

    Terraform relies on state storage. They use the term backend to describe this storage. By default, Terraform uses a local file backend provider. This is great for development, but knowing that I wanted to get things running in Azure DevOps immediately, I decided that I should configure a backend that I can use from my machine as well as from my pipelines.

    As I have been using MinIO pretty heavily for storage, it made the most sense to configure MinIO as the backend, using the S3 backend to do this. It was “fairly” straightforward, as soon as I turned off all the nonsense:

    terraform {
      backend "s3" {
        skip_requesting_account_id  = true
        skip_credentials_validation = true
        skip_metadata_api_check     = true
        skip_region_validation      = true
        use_path_style              = true
        bucket                      = "terraform"
        key                         = "azuread/terraform.tfstate"
        region                      = "us-east-1"
      }
    }

    There are some obvious things missing: I am setting environment variables for values I would like to treat as secret, or, at least not public.

    • MinIO Endpoint -> AWS_ENDPOINT_URL_S3 environment variable instead of endpoints.s3
    • Access Key -> AWS_ACCESS_KEY_ID environment variable instead of access_key
    • Secret Key -> AWS_SECRET_ACCESS_KEY environment variable instead of secret_key

    These settings allow me to use the same storage for both my local machine and the Azure Pipeline.

    Configuration Azure AD

    Likewise, I needed to configure the azuread provider. I followed the steps in the documentation, choosing the environment variable route again. I configured a service principal in Azure and gave it the necessary access to manage my directory.

    Using environment variables allows me to set these from variables in Azure DevOps, meaning my secrets are stored in ADO (or Vault, or both…. more on that in another post).

    Importing Existing Resources

    I have a few resources that already exist in my Azure AD instance, enough that I didn’t want to re-create them and then re-configure everything which uses them. Luckily, most Terraform providers allow for importing existing resources. Thankfully, most of the resources I have support this feature.

    Importing is fairly simple: you create the simplest definition of a resource that you can, and then run a terraform import variant to import that resource into your project’s state. Importing an Azure AD Application, for example, looks like this:

    terraform import azuread_application.myapp /applications/<object-id>

    It is worth noting that the provider is looking for the object-id, not the client ID. The provider documentation has information as to which ID each resource uses for import.

    More importantly, Applications and Service Principals are different resources in Azure AD, even though they are pretty much a one to one. To import a Service Principal, you run a similar command:

    terraform import azuread_service_principal.myprincipal <sp-id>

    But where is the service principal’s ID? I had to go to the Azure CLI to get that info:

    az ad sp list --display myappname

    From this JSON, I grabbed the id value and used that to import.

    From here, I ran a terraform plan to see what was going to be changed. I took a look at the differences, and even added some properties to the terraform files to maintain consistency between the app and the existing state. I ended up with a solid project full of Terraform files that reflected my current state.

    Automating with Azure DevOps

    There are a few extensions available to add Terraform tasks to Azure DevOps. Sadly, most rely on “standard” configurations for authentication against the backends. Since I’m using an S3 compatible backend, but not S3, I had difficulty getting those extensions to function correctly.

    As the Terraform CLI is installed on my build agent, though, I only needed to run my commands from a script. I created an ADO template pipeline (planning for expansion) and extended it to create the pipeline.

    All of the environment variables in the template are reflected in the variable groups defined in the extension. If a variable is not defined, it’s simply blank. That’s why you will see the AZDO_ environment variables in the template, but not in the variable groups for the Azure AD provisioning.

    Stretch: Adding Hashicorp Vault

    Adding HC Vault support was somewhat trivial, but another exercise in authentication. I wanted to use AppRole authentication for this, so I followed the vault provider’s instructions and added additional configuration to my provider. Note that this setup requires additional variables that now need to be set whenever I do a plan or import.

    Once that was done, I had access to read and write values in Vault. I started by storing my application passwords in a new key vault. This allows me to have application passwords that rotate weekly, which is a nice security feature. Unfortunately, the rest of my infrastructure isn’t quite setup to handle such change. At least, not yet.

  • Automating Grafana Backups

    After a few data loss events, I took the time to automate my Grafana backups.

    A bit of instability

    It has been almost a year since I moved to a MySQL backend for Grafana. In that year, I’ve gotten a corrupted MySQL database twice now, forcing me to restore from a backup. I’m not sure if it is due to my setup or bad luck, but twice in less than a year is too much.

    In my previous post, I mentioned the Grafana backup utility as a way to preserve this data. My short-sightedness prevented me from automating those backups, however, so I suffered some data loss. After the most recent event, I revisited the backup tool.

    Keep your friends close…

    My first thought was to simply write a quick Azure DevOps pipeline to pull the tool down, run a backup, and copy it to my SAN. I would have also had to have included some scripting to clean up old backups.

    As I read through the grafana-backup-tool documents, though, I came across examples of running the tool as a Job in Kubernetes via a CronJob. This presented a very unique opportunity: configure the backup job as part of the Helm chart.

    What would that look like? Well, I do not install any external charts directly. They are configured as dependencies for charts of my own. Now, usually, that just means a simple values file that sets the properties on the dependency. In the case of Grafana, though, I’ve already used this functionality to add two dependent charts (Grafana and MySQL) to create one larger application.

    This setup also allows me to add additional templates to the Helm chart to create my own resources. I added two new resources to this chart:

    1. grafana-backup-cron – A definition for the cronjob, using the ysde/grafana-backup-tool image.
    2. grafana-backup-secret-es – An ExternalSecret definition to pull secrets from Hashicorp Vault and create a Secret for the job.

    Since this is all built as part of the Grafana application, the secrets for Grafana were already available. I went so far as to add a section in the values file for the backup. This allowed me to enable/disable the backup and update the image tag easily.

    Where to store it?

    The other enhancement I noticed in the backup tool was the ability to store files in S3 compatible storage. In fact, their example showed how to connect to a MinIO instance. As fate would have it, I have a MinIO instance running on my SAN already.

    So I configured a new bucket in my MinIO instance, added a new access key, and configured those secrets in Vault. After committing those changes and synchronizing in ArgoCD, the new resources were there and ready.

    Can I test it?

    Yes I can. Google, once again, pointed me to a way to create a Job from a CronJob:

    kubectl create job --from=cronjob/<cronjob-name> <job-name> -n <namespace-name>

    I ran the above command to create a test job. And, viola, I have backup files in MinIO!

    Cleaning up

    Unfortunately, there doesn’t seem to be a retention setting in the backup tool. It looks like I’m going to have to write some code to clean up my Grafana backups bucket, especially since I have daily backups scheduled. Either that, or look at this issue and see if I can add it to the tool. Maybe I’ll brush off my Python skills…