Tag: unifi

  • Proxmox, VLANs, and the Bridge Configuration That Almost Broke Me

    Proxmox, VLANs, and the Bridge Configuration That Almost Broke Me

    After successfully migrating 44 wireless devices to proper VLANs, I felt pretty good about myself.

    • Wireless segmentation: ✅
    • Zone-based firewalls: ✅

    Time to tackle the infrastructure, right? Well, Proxmox had other plans.

    The Plan

    I have two Proxmox hosts running my homelab:

    • pmxdell: A Dell laptop with one VM (Azure DevOps agent)
    • pxhp: An HP ProLiant with 17 VMs (three Kubernetes clusters)

    The goal was simple:

    1. Move Proxmox management interfaces to VLAN 60 (Services)
    2. Move VMs to VLAN 50 (Lab)
    3. Celebrate victory

    The execution? Well, let’s just say I learned some things about Linux bridge VLANs that the documentation doesn’t emphasize enough.

    Day 1: pmxdell and False Confidence

    I started with pmxdell because it was the simpler host—just one VM to worry about. I configured a VLAN-aware bridge, added the management IP on VLAN 60, and restarted networking.

    Everything worked. pmxdell came back up on 192.168.60.11. SSH worked. The Proxmox web interface was accessible. I was a networking wizard.

    Then I tried to migrate the VM to VLAN 50.

    qm set 30000 --net0 virtio,bridge=vmbr0,tag=50
    
    qm start 30000

    The VM started. It got… no IP address. DHCP requests disappeared into the void. The VM had no network connectivity whatsoever.

    The Investigation

    My first thought: firewall issue. But the firewall rules were correct—LAB zone could access WAN for DHCP.

    Second thought: DHCP server problem. But other devices on VLAN 50 worked fine.

    Third thought: Maybe I need to restart the VM differently? I stopped it, started it, rebooted it, sacrificed it to the networking gods. Nothing.

    Then I ran bridge vlan show:

    port              vlan-id
    enp0s31f6         1 PVID Egress Untagged
                      50
                      60
    vmbr0             1 PVID Egress Untagged
                      60

    See the problem? VLAN 50 is on the physical interface (enp0s31f6), but not on the bridge device itself (vmbr0). The tap interface for the VM had nowhere to attach to.

    The “bridge-vids” Revelation

    My /etc/network/interfaces configuration looked like this:

    auto vmbr0
    
    iface vmbr0 inet manual
        bridge-ports enp0s31f6
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 1 50 60

    I had assumed—like a reasonable person who reads documentation—that `bridge-vids 1 50 60` would add those VLANs to the entire bridge configuration.

    Wrong.

    bridge-vids only applies VLANs to the bridge ports (the physical interface). It doesn’t touch the bridge device itself. The bridge device needs VLANs added explicitly.

    Why does this matter? Because when Proxmox creates a tap interface for a VM with a VLAN tag, it needs to add that tap interface as a member of that VLAN *on the bridge device*. If the bridge device doesn’t have that VLAN, the tap interface can’t join it.

    VLAN 1 works automatically because it’s the default PVID on bridge devices. Any other VLAN? You have to add it manually.

    The Fix

    The solution was adding explicit post-up commands:

    auto vmbr0
    
    iface vmbr0 inet manual
        bridge-ports enp0s31f6
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 1 50 60
        post-up bridge vlan add dev vmbr0 vid 50 self
        post-up bridge vlan add dev vmbr0 vid 60 self

    Applied the changes, stopped the VM, started it again (not restart—stop then start), and suddenly: DHCP lease acquired. VM online. Victory.

    Day 2: pxhp and the Networking Service Trap

    Armed with my new knowledge, I confidently configured pxhp. Four NICs bonded in LACP, VLAN-aware bridge, proper `post-up` commands. I even remembered to configure the bridge with VLAN 50 support from the start.

    Then I made a critical mistake: I ran systemctl restart networking.

    All 17 VMs instantly lost network connectivity.

    Why Restarting Networking is Evil

    When you run systemctl restart networking on a Proxmox host:

    1. The bridge goes down
    2. All tap interfaces are removed
    3. All VMs lose their network connection
    4. The bridge comes back up
    5. The tap interfaces… don’t automatically recreate

    Your VMs are now running but completely isolated from the network. You have to stop and start each VM to recreate its tap inte4rface.

    The Better Approach: Shutdown VMs first, then restart networking. Or just reboot the entire host and let the VMs come back up automatically with proper tap interfaces.

    I learned this the hard way when I had to stop and start 17 VMs manually. In the middle of the migration. With production workloads running.

    Day 3: Kubernetes and the Blue-Green Migration

    With both Proxmox hosts properly configured, it was time to migrate the Kubernetes clusters. I had three:

    • Non-production (3 VMs)
    • Internal (8 VMs)
    • Production (5 VMs)

    The problem: Kubernetes nodes can’t easily change IP addresses. The IP is baked into etcd configuration, SSL certificates, and about seventeen other places. Changing IPs means major surgery with significant downtime risk.

    The Solution: Blue-green deployment, Kubernetes-style.

    1. Provision new nodes on VLAN 50
    2. Join them to the existing cluster (now you have old + new nodes)
    3. Drain workloads from old nodes to new nodes
    4. Remove old nodes from the cluster
    5. Delete old VMs

    No IP changes. No etcd reconfiguration. No downtime. Just gradual migration while workloads stay running.

    I started with the non-production cluster as a test. The entire migration took maybe 30 minutes, and the cluster never went offline. Workloads migrated seamlessly from old nodes to new nodes.

    As of today, I’m 1 cluster down, 2 to go. The non-production cluster has been running happily on VLAN 50 for a few hours with zero issues.

    What I Learned

    1. bridge-vids is a lie. Or rather, it’s not a lie—it just doesn’t do what you think it does. It configures bridge ports, not the bridge device. Always add explicit post-up commands for VLAN membership.
    2. Never restart networking on Proxmox with running VMs. Just don’t. Either shutdown VMs first, or reboot the whole host. Future you will thank past you.
    3. Blue-green migrations work brilliantly for Kubernetes. Provision new nodes, migrate workloads, remove old nodes. No downtime, no drama.
    4. Stop and start, never restart. When you change VM VLAN configuration, you need to stop the VM then start it. Restart doesn’t recreate the tap interface with new VLAN membership.
    5. Test on simple hosts first. I started with pmxdell (1 VM) before tackling pxhp (17 VMs). That saved me from debugging VLAN issues with production workloads running.

    The Current State

    Infrastructure migration progress:

    • ✅ Proxmox hosts: Both on VLAN 60 (Management)
    • ✅ Kubernetes (non-prod): 3 VMs on VLAN 50
    • ✅ Kubernetes (internal): 7 VMs on VLAN 50
    • ✅ Kubernetes (production): 5 VMs on VLAN 50

    Next steps: Monitor the clusters for 24-48 hours, then migrate internal cluster. Production cluster goes last because I’m not completely reckless.

    You’re missing an agent…

    The astute among you may notice that my internal cluster went from 8 nodes to 7. As I was cycling nodes, I took the time to check the resources on that cluster, and realized that some unrelated work to consolidate observability tools let me scale down to 4 agents. My clusters have started the year off right by losing a little weight.

    Part 3 of the home network rebuild series. Read Part 2: From “HideYoWifi” to “G-Unit”

  • From HideYoWifi to G-Unit

    From HideYoWifi to G-Unit

    A Story of SSID Consolidation and Zone-Based Security

    You know that moment when you’re explaining your home network to someone and you realize how ridiculous it sounds out loud? I had that moment when describing my SSID situation.

    “So I have HideYoWifi, SafetyInNumbers, StopLookingAtMeSwan, and DoIKnowYou

    The look on their face said it all.

    The SSID Situation

    After cleaning up my device inventory (goodbye, 17 identical ubuntu-server instances), I turned my attention to the wireless side of things. I had four SSIDs, all serving the same flat VLAN 1 network. The only difference between them was… well, there wasn’t really a difference. They were functionally identical.

    It was peak home network evolution: each SSID represented a moment in time when I thought “I’ll just create a new one for this use case” without ever deprecating the old ones.

    The Upgrade That Changed Everything

    My UCG Max supported zone-based firewalls, but I’d never enabled them. Why? Because zone-based firewalls are serious networking infrastructure, and I wasn’t sure I needed that level of complexity.

    Then I looked at my flat network with its 77 devices and zero segmentation, and I realized: I absolutely needed that level of complexity.

    On December 17th, I flipped the switch. The UCG Max upgraded to zone-based firewall mode, and suddenly I had the foundation for proper network segmentation. No more flat network. No more “everything can talk to everything” architecture. Just clean, policy-based isolation.

    The SSID Consolidation

    With zone-based firewalls enabled, having four identical SSIDs made even less sense. So I started the consolidation:

    • StopLookingAtMeSwan → Disabled (it had one device: a Blink connection module)
    • SafetyInNumbers → Merged into HideYoWifi (10 devices moved)
    • DoIKnowYou → Kept as guest network (zero devices, but useful for visitors)
    • HideYoWifi → Primary network (for now)

    With my new VLAN architecture, I didn’t want a single “primary” network anymore. I wanted purpose-built SSIDs for different device classes. That meant new SSIDs with actual meaningful names.

    Enter “G-Unit”

    I needed a naming scheme. Something memorable, professional enough for guests, but with personality. I considered:

    • “HomeNet-X” (too boring)
    • “TheSkynet” (too obvious)
    • “NetworkNotFound” (too clever by half)

    For obvious reasons, my family’s group chat name is “G-Unit.” Why not continue with that name?

    And you know what? It actually *worked* as a naming scheme.

    The New SSID Structure:

    • G-Unit → VLAN 10 (Trusted): Phones, laptops, work devices
    • G-Unit-IoT → VLAN 20 (IoT): Smart home devices, sensors, automation
    • G-Unit-Media → VLAN 40 (Media): Chromecasts, streaming devices, smart TVs
    • G-Unit-Guest → VLAN 99 (Guest): Isolated network for visitors

    Clean. Purposeful. Each SSID maps to a specific VLAN with specific firewall rules. No more “everything on VLAN 1” architecture.

    The Migration

    Between December 19th and 26th, I migrated 44 wireless devices across these new SSIDs. It was actually… smooth? Here’s why:

    I kept the old SSIDs running during the migration. Devices could join the new SSIDs at their convenience. No forced cutover. No mass outage. Just gradual, steady progress.

    The results:

    • December 19th: 24 of 41 devices migrated (59%)
    • December 19th evening: 36 of 41 devices migrated (88%)
    • December 26th: 44 of 44 devices migrated (100%)

    That last device? An iPhone that had been forgotten on the old SSID. Once it reconnected to G-Unit, I disabled HideYoWifi for good.

    The Zone-Based Firewall Magic

    With devices properly segmented, I could finally implement the security policies I’d been planning:

    IoT Zone (VLAN 20):

    • Can access Home Assistant (VLAN 60)
    • Can access internet
    • Cannot access file servers
    • Cannot access Proxmox infrastructure
    • Cannot access anything in Lab zone

    Media Zone (VLAN 40):

    • Can access NAS for media streaming (VLAN 60)
    • Can access internet
    • Cannot access IoT devices
    • Cannot access infrastructure

    Trusted Zone (VLAN 10):

    • Admin access to all zones (with logging)
    • Can manage infrastructure
    • Can access all services

    It’s beautiful. My Chromecast can stream from my NAS, but it can’t SSH into my Proxmox hosts. My smart plugs can talk to Home Assistant, but they can’t access my file server. Security through actual network isolation, not just hoping nothing bad happens.

    The Aftermath

    As of December 26th:

    – 100% of wireless devices migrated to zone-based VLANs

    – Zero devices on legacy SSIDs

    – 204 firewall policies actively enforcing isolation

    – Security score: 9.8/10 (up from 4/10 at the start)

    The flat network is dead. Long live the segmented network.

    What I Learned

    1. SSID consolidation is easier than you think. Keep old SSIDs running during migration. Let devices move at their own pace.
    2. Zone-based firewalls change everything. Once you have proper segmentation, you can actually enforce security policies instead of just hoping for the best.
    3. Naming matters. “G-Unit” is objectively ridiculous, but it’s memorable and tells a story. Sometimes that’s more important than being “professional.”
    4. Patience pays off. I could have forced a cutover in one evening. Instead, I spent a week doing gradual migration, and I had zero issues.

    Next up: The infrastructure migration. Proxmox hosts, Kubernetes clusters, and the moment I discovered that bridge-vids doesn’t do what I thought it did.

    Part 2 of the home network rebuild series. Read Part 1: The Accidental Network Archaeologist

  • The Accidental Network Archaeologist

    The Accidental Network Archaeologist

    Discovering 124 devices in my “simple” home network

    I thought I knew my home network. I had a router, some switches, a few VLANs that made sense at the time, and everything just… worked. Until the day I decided to actually document what I had.

    Turns out, I didn’t know my network at all.

    The Discovery

    I fired up the UniFi controller expecting to see maybe 40-50 devices. You know, the usual suspects: phones, laptops, smart home devices, maybe a few Raspberry Pis. The controller reported 124 active devices.

    *One hundred and twenty-four.*

    I immediately had questions. Important questions like “what the hell is ubuntu-server-17?” and “why do I have *seventeen* devices all named ubuntu-server?”

    The Forensics Begin

    Armed with an AI agent and a growing sense of dread, I started the archaeological dig. The results were… enlightening:

    The Good:

    • 5 security cameras actually recording to my NAS
    • A functioning Kubernetes cluster (three of them, actually)
    • Two Proxmox hosts quietly doing their job

    The Bad:

    • 17 identical ubuntu-server instances (spoiler: they were old SQL Server experiments)
    • Devices with names like Unknown-b0:8b:a8:40:16:b6 (which turned out to be my Levoit air purifier)
    • Four SSIDs serving the same flat network because… reasons?

    The Ugly:

    • Everything on VLAN 1
    • No segmentation whatsoever
    • My security cameras had full access to my file server
    • My IoT devices could theoretically SSH into my Proxmox hosts

    The Uncomfortable Truths

    I had built this network over years, making pragmatic decisions that made sense *at the time*. Need another VM? Spin it up on VLAN 1. New smart device? Connect it to the existing SSID. Another Raspberry Pi project? You guessed it—VLAN 1.

    The result was a flat network that looked like a child had organized my sock drawer: functional, but deeply concerning to anyone who knew what they were looking at.

    The Breaking Point

    Two things finally pushed me to action:

    1. The Device Census: After identifying and cleaning up the obvious cruft, I still had 77 active devices with zero network segmentation.

    2. The “What If” Scenario: What if one of my IoT devices got compromised? It would have unfettered access to everything. My NAS. My Proxmox hosts. My Kubernetes clusters. Everything.

    I couldn’t just clean up the device list and call it done. I needed actual network segmentation. Zone-based firewalls. The works.

    The Plan

    I decided on an 8-VLAN architecture:

    • VLAN 1: Management/Infrastructure (ProCurve, UCG Max, core gear)
    • VLAN 10: Trusted (my actual devices)
    • VLAN 20: IoT (smart home stuff that definitely shouldn’t access my files)
    • VLAN 30: Surveillance (cameras recording to NAS)
    • VLAN 40: Media (streaming devices, Chromecast, etc.)
    • VLAN 50: Lab (Kubernetes and experimental infrastructure)
    • VLAN 60: Services (NAS, Home Assistant, critical services)
    • VLAN 99: Guest (for when people visit and I don’t trust their devices)

    Conservative? Maybe. But after discovering 124 devices in what I thought was a “simple” network, I was ready to embrace some architectural paranoia.

    What’s Next

    The past few weeks have been interesting, and the plan is to document my migration over a few posts.

    • First: Immediate security wins (guest network isolation, device cleanup)
    • Second: VLAN infrastructure and zone-based firewall policies
    • Third: Device-by-device migration with minimal disruption
    • Fourth: The scary part—migrating my Kubernetes clusters without breaking everything

    I’ll be documenting the journey here, including the inevitable mistakes, late-night troubleshooting sessions, and that special moment when you realize you’ve locked yourself out of your own network.

    Because if there’s one thing I’ve learned from this experience, it’s that home networks are never as simple as you think they are.

    This is Part 1 of a series on rebuilding my home network from the ground up. Next up: Why “G-Unit” became my SSID naming scheme, and how zone-based firewalls changed everything.

  • Upgrading the Home Network – A New Gateway

    I have run a Unifi Security Gateway (USG) for a while now. In conjunction with three wireless access points, the setup has been pretty robust. The only area I have had some trouble in is the controller software.

    I run the controller software on one of my K8 clusters. The deployment is fairly simple, but if the pod dies unexpectedly, it can cause the MongoDB to become corrupted. It’s happened enough that I religiously back up the controller, and restoring isn’t too terribly painful.

    Additionally, the server and cluster are part of my home lab. If they die, well, I will be inconvenienced, but not down and out. Except, of course, for the Unifi controller software

    Enter the Unifi Cloud Gateways

    Unifi has had a number of different entries into the cloud gateways, including the Dream Machine. The price point was a barrier to entry, especially since I do not really need everything that the Dream Machine line has to offer.

    Recently, they released gateways in a compact form factor. The Cloud Gateway Ultra and Cloud Gateway Max are more reasonably priced, and the Gateway Max allows for the full Unifi application suite in that package. I have been stashing away some cash for network upgrades, and the Cloud Gateway Max seemed like a good first step.

    Network Downtime

    It has become a disturbing fact that I have to schedule network downtime in my own home. With about 85 network connected devices, if someone is home, they are probably on the network. Luckily I found some time to squeeze it in while people were not home.

    The process was longer than expected: the short version is, I was not able to successfully restore a back of my old controller on the new gateway. My network configuration is not that complex, though, so I just recreated the necessary networks and WiFi SSIDs, and things we back up.

    I did face the long and arduous process of making sure all of my static IP assignments were moved from the old system to the new one. I had all the information, it was just tedious copy and paste.

    All in all, it took me about 90 minutes to get everything setup… Thankfully no one complained.

    Unexpected Bonus

    The UCG-Max has 4 ports plus a WAN Port, whereas the USG only had 2 ports plus a WAN port. I never utilized the extra port on the USG: everything went through my switch.

    However, with 3 open ports on the UCG-Max, I can move my APs onto their own port, effectively splitting wireless traffic from wired traffic until it hits the gateway. I don’t know how much of a performance effect will have, but it will be nice to see the difference between wireless and wired internet traffic.

    More To Come…. but not soon

    I have longer term plans for upgrades to my switch and wireless APs, but I am back to zero when it comes to “money saved for network upgrades.” I’ll have to be deliberate in my next upgrades, but hopefully the time won’t be measure in years.

  • Cleaning out the cupboard

    I have been spending a little time in my server cabinet downstairs, trying to organize some things. I took what I thought would be a quick step in consolidation. It was not as quick as I had hoped.

    POE Troubles

    When I got into the cabinet, I realized I had 3 POE injectors in there, powering my three Unifi Access Points. Two of them are the UAP-AC-LR, and the third is a UAP-AC-M. My desire was simple: replace 3 POE injectors with a 5 port PoE switch.

    So, I did what I thought would be a pretty simple process:

    1. Order the switch
    2. Using the MAC, assign it a static IP in my current Unifi Gateway DHCP.
    3. Plug in the switch.
    4. Take the cable coming out of the POE injector and plug it into the switch.

    And that SHOULD be it: devices boot up and I remove the POE injectors. And, for two of the three devices, it worked fine.

    There’s always one

    One of the UAP-AC-LR endpoints simply would not turn on. I thought maybe it was the cable. So I checked the different cables, but still nothing. I swapped out the cables and nothing changed: the one UAP-AC-LRs and the UAP-AC-M worked, but the other UAP-AC-LR did not work.

    I consulted the Oracle and came to realize that I had an old UAP-AC-LR, which only supports a 24v Passive PoE, not the 48v standard that my switch supports. Obviously, the newer UAP-AC-LR and the UAP-AC-M have support 802.3at (or at least a legacy protocols for 48v), but my oldest UAP-AC-LR simply doesn’t turn on.

    The Choice

    There are two solutions, one more expensive than another:

    1. Find an indoor PoE Converter (INS-3AF-I-G) that can convert the 48V coming from my new switch to the 24v that the device needs.
    2. Upgrade! Buy a U6 Pro to replace my old long range access point.

    I like the latter, as it would give me WiFi 6 support and start my upgrade in that area. However, I’m not ready for the price tag at the moment. I was able to find the converter for about $25, and that includes shipping and tax. So I opted for the more economical route in order to get rid of that last PoE injector.

  • Speed. I.. am.. Speed.

    I have heard the opening to the Cars movie more times than I can count, and Owen Wilson’s little monologue always sticks in my head. Tangentially, well, I recently logged in to Xfinity to check my data usage, which sent me down a path towards tracking data usage inside of my network. I learned a lot about what to do, and what not to do.

    We are using how much data?

    Our internet went out on Sunday. This is not a normal occurrence, so I turned off the wifi on my phone and logged in to Xfinity’s website to report the problem. Out of sheer curiosity, after reporting the downtime, I clicked on the link to see our usage.

    22 TB… That’s right, 22 terabytes of data in February, and approaching 30TB for March. And there were still 12 days left in March! Clearly, something was going on.

    Asking the Unifi Controller

    I logged in to the Unifi controller software in the hopes of identifying the source of this traffic. I jumped into the security insights, traffic identification, and looked at the data for the last month. Not one client showed more than 25 GB of traffic in the last month. That does not match what Xfinity is showing at all.

    A quick Google search lead me to a few posts that suggest that the Unifi’s automated speed test can boost your data usage, but that it doesn’t show on the Unifi. Mind you, these posts were 4+ years old, but I figured it was worth a shot. So I disabled the speed test in the Unifi controller, but would have to wait a day to see if the Xfinity numbers changed.

    Fast forward a day – No change. According to Xfinity I was using something like 500GB of data per day, which is nonsense. My previous months were never higher than 2TB, so using that much data in 4 days means something is wrong.

    Am I being hacked?

    Thanks to “security first” being beat into me by some of my previous security-focused peers, the first thought in my head was “Am I being hacked?” I looked through logs on the various machines and clusters, trying to find where this data was coming from and why. But nothing looked odd or out of the ordinary: no extra pods running calculations, no servers consuming huge amounts of memory or CPU. So where in the world was 50TB of data coming from?

    Unifi Poller to the Rescue

    At this point, I remembered that I have Unifi Poller running. The poller grabs data from my Unifi controller and puts it into Mimir. I started poking around the unpoller_ metrics until I found unpoller_device_bytes_total. Looking at that value for my Unifi Security Gateway, well, I saw this graph for the last 30 days:

    unpoller_device_bytes_total – Last 30 Days

    The scale on the right is bytes, so, 50,000,000,000,000 bytes, or roughly 50TB. Since I am not yet collecting the client DPI information with Unifi Poller, I just traced this data back to the start of this ramp up. It turned out to be February 14th at around 12:20 pm.

    GitOps for the win

    Since my cluster states are stored in Git repos, any changes to the state of things are logged as commits to the repository, making it very easy to track back. Combing through my commits for 2/14 around noon, I found the offending commit in the speedtest-exporter (now you see the reference to Lightning McQueen).

    In an effort to move off of the k8s-at-home charts, which are no longer being maintained, I have switch over to creating charts using Bernd Schorgers’ library chart to manage some of the simple installs. The new chart configured the ServiceMonitor to scrape every minute, which meant, well, that I was running a speed test every minute. Of every day. For a month.

    To test my theory, I shut down the speedtest-exporter pod. Before my change, I was seeing 5 and 6 GB of traffic every 30 seconds. With the speed test executing hourly, I am seeing 90-150 MB of traffic every 30 seconds. Additionally, the graph is much more sporadic, which makes more sense: I would expect traffic to increase when my kids are home and watching TV, and decrease at night. What I was seeing was a constant increase over time, which is what pointed me to the speed test. So I fixed the ServiceMonitor to only scrape once an hour, and I will check my data usage in Xfinity tomorrow to see how I did.

    My apologies to my neighbors and Xfinity

    Breaking this down, I have been using something around 1TB of bandwidth per day over the last month. So, I apologize to my neighbors for potentially slowing everything down, and to Xfinity for running speed tests once a minute. That is not to say that I will stop running speed tests, but I will go back to testing once an hour rather than once a minute.

    Additionally, I’m using my newfound knowledge of the Unifi Poller metrics to write some alerts so that I can determine if too much data is coming in and out of the network.