Discovering 124 devices in my “simple” home network
I thought I knew my home network. I had a router, some switches, a few VLANs that made sense at the time, and everything just… worked. Until the day I decided to actually document what I had.
Turns out, I didn’t know my network at all.
The Discovery
I fired up the UniFi controller expecting to see maybe 40-50 devices. You know, the usual suspects: phones, laptops, smart home devices, maybe a few Raspberry Pis. The controller reported 124 active devices.
*One hundred and twenty-four.*
I immediately had questions. Important questions like “what the hell is ubuntu-server-17?” and “why do I have *seventeen* devices all named ubuntu-server?”
The Forensics Begin
Armed with an AI agent and a growing sense of dread, I started the archaeological dig. The results were… enlightening:
The Good:
5 security cameras actually recording to my NAS
A functioning Kubernetes cluster (three of them, actually)
Two Proxmox hosts quietly doing their job
The Bad:
17 identical ubuntu-server instances (spoiler: they were old SQL Server experiments)
Devices with names like Unknown-b0:8b:a8:40:16:b6 (which turned out to be my Levoit air purifier)
Four SSIDs serving the same flat network because… reasons?
The Ugly:
Everything on VLAN 1
No segmentation whatsoever
My security cameras had full access to my file server
My IoT devices could theoretically SSH into my Proxmox hosts
The Uncomfortable Truths
I had built this network over years, making pragmatic decisions that made sense *at the time*. Need another VM? Spin it up on VLAN 1. New smart device? Connect it to the existing SSID. Another Raspberry Pi project? You guessed it—VLAN 1.
The result was a flat network that looked like a child had organized my sock drawer: functional, but deeply concerning to anyone who knew what they were looking at.
The Breaking Point
Two things finally pushed me to action:
1. The Device Census: After identifying and cleaning up the obvious cruft, I still had 77 active devices with zero network segmentation.
2. The “What If” Scenario: What if one of my IoT devices got compromised? It would have unfettered access to everything. My NAS. My Proxmox hosts. My Kubernetes clusters. Everything.
I couldn’t just clean up the device list and call it done. I needed actual network segmentation. Zone-based firewalls. The works.
Second: VLAN infrastructure and zone-based firewall policies
Third: Device-by-device migration with minimal disruption
Fourth: The scary part—migrating my Kubernetes clusters without breaking everything
I’ll be documenting the journey here, including the inevitable mistakes, late-night troubleshooting sessions, and that special moment when you realize you’ve locked yourself out of your own network.
Because if there’s one thing I’ve learned from this experience, it’s that home networks are never as simple as you think they are.
This is Part 1 of a series on rebuilding my home network from the ground up. Next up: Why “G-Unit” became my SSID naming scheme, and how zone-based firewalls changed everything.
Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.
This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?
Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.
The Problem: Telemetry Taking the Scenic Route
My setup looked like this:
– Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway
– Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.
– Nonproduction cluster: Same deal, different tenant ID
Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.
This meant:
– An unnecessary hop through the Nginx proxy layer
– Extra TLS handshakes that didn’t add security value between internal services
– DNS resolution for external service names when direct cluster DNS would suffice
– One more component in the path that could cause issues
The Solution: Hub-and-Spoke with Linkerd Multi-Cluster
Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.
For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.
The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.
Gateway Mode vs. Flat Network
(Spoiler: Gateway Mode Won)
Linkerd offers two approaches for multi-cluster communication:
Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.
Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.
The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.
The Configuration: Getting the Helm Values Right
Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:
linkerd-multicluster:gateway:enabled: trueport: 4143serviceType: NodePortnodePort: 30143probe:port: 4191nodePort: 30191# Grant access to service accounts from other clustersremoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction
And for the production/nonproduction clusters:
linkerd-multicluster:gateway:enabled: false# No gateway needed hereremoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local
The Link: Connecting Clusters Without Auto-Discovery
Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:
Separating --gateway-addresses and --gateway-port made all the difference.
I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.
Service Export: Making Services Visible Across Clusters
Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.
For the Loki gateway (and similarly for Mimir and Tempo):
The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:
No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.
But wait—there’s one more step.
The Linkerd Injection Gotcha
Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.
This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:
alloy:controller: # Not alloy.alloy!podAnnotations:linkerd.io/inject: enabledalloy:# ... other config
Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:
– `linkerd-proxy` (the magic sauce)
– `alloy` (the telemetry collector)
– `config-reloader` (for hot config reloads)
Checking the gateway logs confirmed traffic was flowing:
There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.
From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.
What I Learned
1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.
2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.
3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.
4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.
5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.
The Result
Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.
I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.
The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.
Sometimes migrations aren’t about chasing the newest technology—they’re about abandoning ship before it sinks. In December 2025, MinIO officially entered “maintenance mode” for its open-source edition, effectively ending active development. Combined with earlier moves like removing the admin UI, discontinuing Docker images, and pushing users toward their $96,000+ AIStor paid product, the writing was on the wall: MinIO’s open-source days were over.
Time to find a replacement.
Why I Had to Leave MinIO
Let’s be clear: MinIO used to be excellent open-source software. Past tense. Over the course of 2025, the company systematically dismantled what made it valuable for home lab and small-scale deployments:
June 2025: Removed the web admin console from the Community Edition. Features like bucket configuration, lifecycle policies, and account management became CLI-only—or you could pay for AIStor.
The pattern was obvious: push users toward AIStor, a proprietary product starting at nearly $100k, by making the open-source version progressively less usable. The community called it what it was—a lock-in strategy disguised as “streamlining.”
I’m not paying six figures for object storage in my home lab. Time to migrate.
Enter Garage
I needed S3-compatible storage that was:
Actually open source, not “open source until we change our minds”
Lightweight, suitable for single-node deployments
Actively maintained by a community that won’t pull the rug out
Garage checked all the boxes. Built in Rust by the Deuxfleurs collective, it’s designed for geo-distributed deployments but scales down beautifully to single-node setups. More importantly, it’s genuinely open source—developed by a collective, not a company with a paid product to upsell.
The Migration Process
Vault: The Critical Path
Vault was the highest-stakes piece of this migration. It’s the backbone of my secrets management, and getting this wrong meant potentially losing access to everything. I followed the proper migration path:
Stopped the Vault pod in my Kubernetes cluster—no live migrations, no shortcuts
Used vault operator migrate to transfer the storage backend from MinIO to Garage—this is the officially supported method that ensures data integrity
Updated the vault-storage-config Kubernetes secret to point at the new Garage endpoint
Restarted Vault and unsealed it with my existing keys
The vault operator migrate command handled the heavy lifting, ensuring every key-value pair transferred correctly. While I could have theoretically just mirrored S3 buckets and updated configs, using the official migration tool gave me confidence nothing would break in subtle ways later.
Monitoring Stack: Configuration Updates
With Vault successfully migrated, the rest was straightforward. I updated S3 endpoint configurations across my monitoring stack in ops-internal-cluster:
Loki, Mimir, and Tempo all had their storage backends updated:
Old: cloud.gerega.net:39000 (MinIO)
New: cloud.gerega.net:3900 (Garage)
I intentionally didn’t migrate historical metrics and logs. This is a lab environment—losing a few weeks of time-series data just means starting fresh with cleaner retention policies. In production, you’d migrate this data. Here? Not worth the effort.
Monitoring Garage Itself
I added a Grafana Alloy scrape job to collect Garage’s Prometheus metrics from its /metrics endpoint. No blind spots from day one—if Garage has issues, I’ll know immediately.
Deployment Architecture
One deliberate choice: Garage runs as a single Docker container on bare metal, not in Kubernetes. Object storage is foundational infrastructure. If my Kubernetes clusters have problems, I don’t want my storage backend tied to that failure domain.
Running Garage outside the cluster means:
Vault stores data independently of cluster state
Monitoring storage (Loki, Mimir, Tempo) persists during cluster maintenance
One less workload competing for cluster resources
Verification and Cleanup
Before decommissioning MinIO, I verified nothing was still pointing at the old endpoints:
# Searched across GitOps repos
grep -r "39000" . # Old MinIO port
grep -r "192.168.1.30" . # Old MinIO IP
grep -r "s3.mattgerega.net" .
Clean sweep—everything migrated successfully.
Current Status
Garage has been running for about a week now. Resource usage is lower than MinIO ever was, and everything works:
Vault sealed/unsealed multiple times without issues
Loki ingesting logs from multiple clusters
Mimir storing metrics from Grafana Alloy
Tempo collecting distributed traces
The old MinIO instance is still running but idle. I’ll give it another week before decommissioning entirely—old habits die hard, and having a fallback during initial burn-in feels prudent.
Port 3900 is the new standard. Port 39000 is legacy. And my infrastructure is no longer dependent on a company actively sabotaging its open-source product.
Lessons for the Homelab Community
If you’re still running MinIO Community Edition, now’s the time to plan your exit strategy. The maintenance-mode announcement wasn’t a surprise—it was the inevitable conclusion of a year-long strategy to push users toward paid products.
Alternatives worth considering:
Garage: What I chose. Lightweight, Rust-based, genuinely open source.
SeaweedFS: Go-based, active development, designed for large-scale deployments but works at small scale.
Ceph RGW: If you’re already running Ceph, the RADOS Gateway provides S3 compatibility.
The MinIO I deployed years ago was a solid piece of open-source infrastructure. The MinIO of 2025 is a bait-and-switch. Learn from my migration—don’t wait until you’re forced to scramble.
Technical Details:
Garage deployment: Single Docker container on bare metal
As with any good engineer, I cannot leave well enough alone. Over the past week, I’ve been working through a significant infrastructure modernization across my home lab clusters – migrating from NGINX Ingress to Envoy Gateway and implementing the Kubernetes Gateway API. This also involved some necessary housekeeping with chart updates and a shift to Server-Side Apply for all ArgoCD-managed resources.
Why Change?
The timing couldn’t have been better. In November 2024, the Kubernetes SIG Network and Security Response Committee announced that Ingress NGINX will be retired in March 2026. The project has struggled with insufficient maintainer support, security concerns around configuration snippets, and accumulated technical debt. After March 2026, there will be no further releases, security patches, or bug fixes.
The announcement strongly recommends migrating to the Gateway API, described as “the modern replacement for Ingress.” This validated what I’d already been considering – the Gateway API provides a more standardized, vendor-neutral approach with better separation of concerns between infrastructure operators and application developers.
Envoy Gateway, being a CNCF project built on the battle-tested Envoy proxy, seemed like a natural choice for this migration. Plus, it gave me an excuse to finally move off Traefik, which was… well, let’s just say it was time for a change.
The Migration Journey
The migration happened in phases across my ops-argo, ops-prod-cluster, and ops-nonprod-cluster repositories. Here’s what changed:
Phase 1: Adding Envoy Gateway
I started by adding Envoy Gateway as a cluster tool, complete with its own ApplicationSet that deploys to clusters labeled with spydersoft.io/envoy-gateway: "true". The deployment includes:
GatewayClass and Gateway resources: Defined a main gateway that handles traffic routing
EnvoyProxy configuration: Set up with a static NodePort service for consistent external access
ClientTrafficPolicy: Configured to properly handle forwarded headers – crucial for preserving client IP information through the proxy chain
The Envoy Gateway deployment lives in the envoy-gateway-system namespace and exposes services via NodePort 30080 and 30443, making it easy to integrate with my existing network setup.
Phase 2: Migrating Applications to HTTPRoute
This was the bulk of the work. Each application needed its Ingress resource replaced with an HTTPRoute. The new Gateway API resources are much cleaner. For example, my blog (www.mattgerega.com) went from an Ingress definition to this:
This is a better practice for ArgoCD as it allows Kubernetes to handle three-way merge patches instead of client-side strategic merge, reducing conflicts and improving sync reliability.
Lessons Learned
Gateway API is ready for production: The migration was surprisingly smooth. The Gateway API resources are well-documented and intuitive. With NGINX Ingress being retired, now’s the time to make the jump.
HTTPRoute vs. Ingress: HTTPRoute is more expressive and allows for more sophisticated routing rules. The explicit parentRefs concept makes it clear which gateway handles which routes.
Server-Side Apply everywhere: Should have done this sooner. The improved conflict handling makes ArgoCD much more reliable, especially when multiple controllers touch the same resources.
Envoy’s configurability: The EnvoyProxy custom resource gives incredible control over the proxy configuration without needing to edit ConfigMaps or deal with annotations.
Multi-cluster consistency: Making these changes across production and non-production environments simultaneously kept everything aligned and reduced cognitive overhead when switching between environments.
Current Status
All applications across all clusters are now running through Envoy Gateway with the Gateway API. Traffic is flowing correctly, TLS is terminating properly, and I’ve removed all the old ingress-related configuration from both production and non-production environments.
The clusters are more standardized, the configuration is cleaner, and I’m positioned to take advantage of future Gateway API features like traffic splitting and more advanced routing capabilities. More importantly, I’m ahead of the March 2026 retirement deadline with plenty of time to spare.
Now, the real question: what am I going to tinker with next?
Over the past few weeks, I’ve been on both a physical and digital cleaning spree. It was long overdue, and honestly, it feels like a weight has been lifted from my shoulders.
Winterizing everything
Technically, fall just started a week ago. But in the Northeast, “fall” can mean anything from 80-degree afternoons to an early snowstorm. With school and sports in full swing, the pool had seen its last swim of the season, which meant it was time to close things up. Along with that came the annual migration of tropical plants into the house for the winter.
Before I could even get there, though, my storage shed and garage were in desperate need of a purge. Ten contractor bags later, I finally had the space to neatly store the things that actually matter.
With that newfound space came the itch to reorganize. I moved a few items from the garage to the shed, built out some shelves and lofts, and—of course—came up with a dozen new project ideas, like adding a ramp to the shed. Luckily, I reined in the scope creep and wrapped things up neatly for winter.
Digital Destruction
On the digital front, I’d been putting off a project for a while: decommissioning my local Active Directory domain. The only reason I had one in the first place was to make managing Windows servers easier. But as I’ve shifted to Proxmox and Kubernetes clusters, the need for Active Directory dropped off pretty quickly.
Most of my DNS had already moved to my Unifi Gateway. The only holdup was that a few personal machines were still joined to the domain, meaning I had to migrate user profiles. Not difficult—just tedious.
In full cleanup mode, I finally bit the bullet. After an hour or so per machine, everything was running on local profiles, disconnected from the domain. With that, I shut down the AD servers and haven’t looked back.
Streamlining
I’m happy with where things landed. While I don’t have a centralized user directory anymore, I’ve gained flexibility—and peace of mind. My AD domain was running on “lab-grade” hardware, so losing it would’ve been a headache. Now, I don’t have to worry.
Nearly everything powering my home automation has already been moved off the lab gear, except for a single NodeRed instance. I haven’t decided where to run it yet, but it’ll be migrated soon.
With this cleanup, I’ve officially decommissioned my last two Windows servers—the domain controllers. My home lab is now fully containerized, and my garage and shed are finally ready for winter.
In short: a fall purge on both fronts—physical and digital—left me with more space, less clutter, and a lot more breathing room.
A lot has gone on this summer. Work efforts have kept me busy, and I have spent a lot of “off” time researching ways to improve our services at work. That said, I have had some time to get a few things done at home.
Proxmox Move
I was able to get through my move to Proxmox servers. It was executed, roughly, as follows:
Created new scripts to provision RKE2 nodes in Proxmox.
Provision new RKE2 node VMs on my temporary Proxmox node, effectively migrating the clusters from Hyper-V to Proxmox.
Wipe and install Proxmox on my server.
Provision new RKE2 node VMs on my new server, effectively migrating the clusters (again) to the new server.
I have noticed that, when provisioning new machines via a VM clone, my IO delay gets a bit high, and some of the other VMs don’t like that. For now, it’s manageable, as I don’t provision often, but as I plan out a new cluster, disk IO is something to keep in mind.
Moving my DNS
I moved my DNS server to my Unifi Cloud Gateway Max. The Unifi Controller running on there has been very stable, and I am already communicating with it’s API to provision fixed IPs based on MAC addresses, so adding local DNS records was the next step.
Thankfully, I rebuilt my Windows domain to use a different domain than my normal DNS routing. So I was able to move my routing domain to the UCG and add a forwarding record to my Windows domain. At that point, the only machines left on the domain were the domain joined ones.
Getting rid of the domain
At this point, I am considering decommissioning my Windows Domain. However, I have a few more moves to make before that happens. As luck would have it< i have some ideas as to how to make it work. Unfortunately for my readers, that will come in a later post.
Oh, and, another teaser…. I printed a new server rack. More show and tell later!
As with any good engineer, I cannot leave well enough alone. My current rainy day project is reconfiguring my home lab for some much needed updates and simplification.
What’s Wrong?
My home lab is, well, still going strong. My automation scripts work well, and I don’t spend a ton of time doing what I need to do to keep things up to date, at least when it comes to my Kubernetes clusters.
The other servers, however, are in a scary spot. Everything is running on top of the free version of Windows Hyper-V Server from 2019, so general updates are a concern. I would LOVE to move to Windows Server 2025, but I do not have the money for that kind of endeavor.
The other issue with running a Windows Server is that, well, they usually expected a Windows Domain (or, at least, my version does). This requirement has forced me to run my own domain controllers for a number of years now. Earlier iterations of my lab included a lot of Windows VMs, so the domain helped me manage authentication across them all. But, with RKE2 and Kubernetes running the bulk of my workloads, the domain controllers are more hassle than anything right now.
The Plan
My current plan is to migrate my home server to Proxmox. It seems a pretty solid replacement for Hyper-V, and has a few features in it that I may use in the future, like using cloud-init for creating new cluster nodes and better management of storage.
Obviously, this is going to require some testing, and luckily, my old laptop is free for some experimentation. So I installed Proxmox there and messed around, and I came up with an interesting plan.
Migrate my VMs to my laptop instance of Proxmox, reducing the workload as much as I can.
Install Proxmox on my server
Create a Proxmox cluster with my laptop and server as the nodes.
Transfer my VMs from the laptop node to the server node.
Cutting my Workload
My laptop is a paltry 32GB of RAM, compared to 288 GB in my server. While I need to get everything “over” to the laptop, it doesn’t all have to be running at the same time.
For the windows VMs, my current plan is as follows:
Move my primary domain controller to the laptop, but run at a reduced capacity (1 CPU/2GB).
Move my backup DC to the laptop, shut it down.
Move and shut down both SQL Server instances: they are only running lab DBs, nothing really vital.
For my clusters, I’m not actually going to “move” the VMs. I’m going to create new nodes on the laptop proxmox, add them to the clusters, and then deprovision the old ones. This gives me some control over what’s there.
Non-Production Cluster -> 1 control plane server, 2 agents, but shut them down.
Internal Cluster -> 1 control plane server (down from 3), 3 agents, all shut down.
Production Cluster -> 1 control plane (down from 3), 2 agents, running vital software. I may need to migrate my HC Vault instance to the production cluster just to ensure secrets stay up and running.
With this setup, I should really only have 4 VMs running on my laptop, which it should be able to handle. Once that’s done, I’ll have time to install and configure Proxmox on the server, and then move VMs from the laptop to the server.
Lots to do
I have a lot of learning to do. Proxmox seems pretty simple to start, but I find I’m having to read a lot about the cloning and cloud-init pieces to really make use of the power of the tool.
Once I feel comfortable with Proxmox, the actual move will need scheduled… So, maybe by Christmas I’ll actually have this done.
I ran into an odd situation last week with ArgoCD, and it took a bit of digging to figure it out. Hopefully this helps someone else along the way.
Whatever you do, don’t panic!
Well, unless of course you are ArgoCD.
I have a small Azure DevOps job that runs nightly and attempts to upgrade some of the Helm charts that I use to deploy external tools. This includes things like Grafana, Loki, Mimir, Tempo, ArgoCD, External Secrets, and many more. This job deploys the changes to my GitOps repositories, and if there are changes, I can manually sync.
Why not auto-sync, you might ask? Visibility, mostly. I like to see what changes are being applied, in case there is something bigger in the changes that needs my attention. I also like to “be there” if something breaks, so I can rollback quickly.
Last week, while upgrading Grafana and Tempo, ArgoCD started throwing the following error on sync:
Recovered from panic: runtime error: invalid memory address or nil pointer
A quick trip to Google produced a few different results, but nothing immediately apparent. One particular issue mentioned that they had a problem with out-of-date resources (old apiversion). Let’s put a pin in that.
Nothing was jumping out, and my deployments were still working. I had a number of other things on my plate, so I let this slide for a few days.
Versioning….
When I finally got some time to dig into this, I figured I would pull at that apiversion string and see what shook loose. Unfortunately, as there is no real good error as to which resource is causing it, it was luck of the draw as to whether or not I found the offender. This time, I was lucky.
My ExternalSecret resources were using some alpha versions, so my first thought was to update to the v1 version. Lowe and behold, that fixed the two charts which were failing.
This, however, leads to a bigger issue: if ArgoCD is not going to inform me when I have out of date apiversion values for a resource, I am going to have to figure out how to validate these resources sometime before I commit the changes. I’ll put this on my ever growing to do list.
I have run a Unifi Security Gateway (USG) for a while now. In conjunction with three wireless access points, the setup has been pretty robust. The only area I have had some trouble in is the controller software.
I run the controller software on one of my K8 clusters. The deployment is fairly simple, but if the pod dies unexpectedly, it can cause the MongoDB to become corrupted. It’s happened enough that I religiously back up the controller, and restoring isn’t too terribly painful.
Additionally, the server and cluster are part of my home lab. If they die, well, I will be inconvenienced, but not down and out. Except, of course, for the Unifi controller software
Enter the Unifi Cloud Gateways
Unifi has had a number of different entries into the cloud gateways, including the Dream Machine. The price point was a barrier to entry, especially since I do not really need everything that the Dream Machine line has to offer.
Recently, they released gateways in a compact form factor. The Cloud Gateway Ultra and Cloud Gateway Max are more reasonably priced, and the Gateway Max allows for the full Unifi application suite in that package. I have been stashing away some cash for network upgrades, and the Cloud Gateway Max seemed like a good first step.
Network Downtime
It has become a disturbing fact that I have to schedule network downtime in my own home. With about 85 network connected devices, if someone is home, they are probably on the network. Luckily I found some time to squeeze it in while people were not home.
The process was longer than expected: the short version is, I was not able to successfully restore a back of my old controller on the new gateway. My network configuration is not that complex, though, so I just recreated the necessary networks and WiFi SSIDs, and things we back up.
I did face the long and arduous process of making sure all of my static IP assignments were moved from the old system to the new one. I had all the information, it was just tedious copy and paste.
All in all, it took me about 90 minutes to get everything setup… Thankfully no one complained.
Unexpected Bonus
The UCG-Max has 4 ports plus a WAN Port, whereas the USG only had 2 ports plus a WAN port. I never utilized the extra port on the USG: everything went through my switch.
However, with 3 open ports on the UCG-Max, I can move my APs onto their own port, effectively splitting wireless traffic from wired traffic until it hits the gateway. I don’t know how much of a performance effect will have, but it will be nice to see the difference between wireless and wired internet traffic.
More To Come…. but not soon
I have longer term plans for upgrades to my switch and wireless APs, but I am back to zero when it comes to “money saved for network upgrades.” I’ll have to be deliberate in my next upgrades, but hopefully the time won’t be measure in years.
I have been running a free version of Proget locally for years now. It served as a home for Nuget packages, Docker images, and Helm charts for my home lab projects. But, in an effort to slim down the apps that are running in my home lab, I took a look at some alternatives.
Where can I put my stuff?
When I logged in to my Proget instance and looked around, it occurred to me that I only had 3 types of feeds: Nuget packages, Docker images, and Helm charts. So to move off of Proget, I need to find replacements for all of these.
Helm Charts
Back in the heady days of using Octopus Deploy for my home lab, I used published Helm charts to deploy my applications. However, since I switched to a Gitops workflow with ArgoCD, I haven’t published a Helm chart in a few years. I deleted that feed in Proget. One down, two to go.
Nuget Packages
I have made a few different attempts to create Nuget packages for public consumption. A number of years ago, I tried publishing a data layer that was designed to be used across platforms (think APIs and mobile applications), but even I stopped using that in favor of Entity Framework Core and good old fashioned data models. More recently, I created some “platform” libraries to encapsulate some of the common code that I use in my APIs and other projects. They serve as utility libraries as well as a reference architecture for my professional work.
There are a number of options for hosting Nuget feeds, with varying costs depending on structure. I considered the following options:
Azure DevOps Artifacts
Github Packages
Nuget.org
I use Azure DevOps for my builds, and briefly considered using the artifacts feeds. However, none of my libraries are private. Everything I am writing is a public repository in Github. With that in mind, it seemed that the free offerings from Github and Nuget were more appropriate.
I published the data layer packages to Nuget previously, so I have some experience with that. However, with these platform libraries, while they are public, I do not expect them to be heavily used. For that reason, I decided that publishing the packages to Github Packages made a little more sense. If these platform libraries get to the point where they are heavily used, I can always publish stable packages to Nuget.org.
Container Images
In terms of storage percentage, container images take up the bulk of my Proget storage. Now, I only have 5 container images, but I never clean anything up, so those 5 containers are taking up about 7 GB of data. When I was investigating alternatives, I wanted to make sure I had some way to clean up old pre-release tags and manifests to keep my usage down.
I considered two alternatives:
Azure Container Registry
Github Container Registry
An Azure Container Registry instance would cost me about $5 a month and provide me with 10 GB of storage. Github Container Registry provides 500MB of storage and 1GB of Data transfer per month, but that is for private repositories.
As with my Nuget packages, nothing that I have is private. Github packages are free for public packages. Additionally, I found a Github task that will clean up Github the images. As this was one of my “new” requirements, I decided to take a run at Github packages.
Making the switch
With my current setup, the switch was fairly simple. Nuget publishing is controlled by my Azure DevOps service connections, so I created a new service connection for my Github feed. The biggest change was some housekeeping to add appropriate information to the Nuget package itself. This included added the RepositoryUrl property on the .csproj files. This tells Github which repository to associate the package with.
Container registry wasn’t much different, and again, some housekeeping in adding the appropriate labels to the images. From there, a few template changes and the images were in the Github container registry.
Overall, the changes were pretty minimal. I have a few projects left to convert, and once that is done, I can decommission my Proget instance.
Next on the chopping block…
I am in the beginning stages of evaluating Azure Key Vault as a replacement for my Hashicorp Vault instance. Although it comes at a cost, for my usage it is most likely under $3 a month, and getting away from self-hosted secrets management would make me a whole lot happier.