Tag: Proxmox

  • Proxmox, VLANs, and the Bridge Configuration That Almost Broke Me

    Proxmox, VLANs, and the Bridge Configuration That Almost Broke Me

    After successfully migrating 44 wireless devices to proper VLANs, I felt pretty good about myself.

    • Wireless segmentation: ✅
    • Zone-based firewalls: ✅

    Time to tackle the infrastructure, right? Well, Proxmox had other plans.

    The Plan

    I have two Proxmox hosts running my homelab:

    • pmxdell: A Dell laptop with one VM (Azure DevOps agent)
    • pxhp: An HP ProLiant with 17 VMs (three Kubernetes clusters)

    The goal was simple:

    1. Move Proxmox management interfaces to VLAN 60 (Services)
    2. Move VMs to VLAN 50 (Lab)
    3. Celebrate victory

    The execution? Well, let’s just say I learned some things about Linux bridge VLANs that the documentation doesn’t emphasize enough.

    Day 1: pmxdell and False Confidence

    I started with pmxdell because it was the simpler host—just one VM to worry about. I configured a VLAN-aware bridge, added the management IP on VLAN 60, and restarted networking.

    Everything worked. pmxdell came back up on 192.168.60.11. SSH worked. The Proxmox web interface was accessible. I was a networking wizard.

    Then I tried to migrate the VM to VLAN 50.

    qm set 30000 --net0 virtio,bridge=vmbr0,tag=50
    
    qm start 30000

    The VM started. It got… no IP address. DHCP requests disappeared into the void. The VM had no network connectivity whatsoever.

    The Investigation

    My first thought: firewall issue. But the firewall rules were correct—LAB zone could access WAN for DHCP.

    Second thought: DHCP server problem. But other devices on VLAN 50 worked fine.

    Third thought: Maybe I need to restart the VM differently? I stopped it, started it, rebooted it, sacrificed it to the networking gods. Nothing.

    Then I ran bridge vlan show:

    port              vlan-id
    enp0s31f6         1 PVID Egress Untagged
                      50
                      60
    vmbr0             1 PVID Egress Untagged
                      60

    See the problem? VLAN 50 is on the physical interface (enp0s31f6), but not on the bridge device itself (vmbr0). The tap interface for the VM had nowhere to attach to.

    The “bridge-vids” Revelation

    My /etc/network/interfaces configuration looked like this:

    auto vmbr0
    
    iface vmbr0 inet manual
        bridge-ports enp0s31f6
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 1 50 60

    I had assumed—like a reasonable person who reads documentation—that `bridge-vids 1 50 60` would add those VLANs to the entire bridge configuration.

    Wrong.

    bridge-vids only applies VLANs to the bridge ports (the physical interface). It doesn’t touch the bridge device itself. The bridge device needs VLANs added explicitly.

    Why does this matter? Because when Proxmox creates a tap interface for a VM with a VLAN tag, it needs to add that tap interface as a member of that VLAN *on the bridge device*. If the bridge device doesn’t have that VLAN, the tap interface can’t join it.

    VLAN 1 works automatically because it’s the default PVID on bridge devices. Any other VLAN? You have to add it manually.

    The Fix

    The solution was adding explicit post-up commands:

    auto vmbr0
    
    iface vmbr0 inet manual
        bridge-ports enp0s31f6
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 1 50 60
        post-up bridge vlan add dev vmbr0 vid 50 self
        post-up bridge vlan add dev vmbr0 vid 60 self

    Applied the changes, stopped the VM, started it again (not restart—stop then start), and suddenly: DHCP lease acquired. VM online. Victory.

    Day 2: pxhp and the Networking Service Trap

    Armed with my new knowledge, I confidently configured pxhp. Four NICs bonded in LACP, VLAN-aware bridge, proper `post-up` commands. I even remembered to configure the bridge with VLAN 50 support from the start.

    Then I made a critical mistake: I ran systemctl restart networking.

    All 17 VMs instantly lost network connectivity.

    Why Restarting Networking is Evil

    When you run systemctl restart networking on a Proxmox host:

    1. The bridge goes down
    2. All tap interfaces are removed
    3. All VMs lose their network connection
    4. The bridge comes back up
    5. The tap interfaces… don’t automatically recreate

    Your VMs are now running but completely isolated from the network. You have to stop and start each VM to recreate its tap inte4rface.

    The Better Approach: Shutdown VMs first, then restart networking. Or just reboot the entire host and let the VMs come back up automatically with proper tap interfaces.

    I learned this the hard way when I had to stop and start 17 VMs manually. In the middle of the migration. With production workloads running.

    Day 3: Kubernetes and the Blue-Green Migration

    With both Proxmox hosts properly configured, it was time to migrate the Kubernetes clusters. I had three:

    • Non-production (3 VMs)
    • Internal (8 VMs)
    • Production (5 VMs)

    The problem: Kubernetes nodes can’t easily change IP addresses. The IP is baked into etcd configuration, SSL certificates, and about seventeen other places. Changing IPs means major surgery with significant downtime risk.

    The Solution: Blue-green deployment, Kubernetes-style.

    1. Provision new nodes on VLAN 50
    2. Join them to the existing cluster (now you have old + new nodes)
    3. Drain workloads from old nodes to new nodes
    4. Remove old nodes from the cluster
    5. Delete old VMs

    No IP changes. No etcd reconfiguration. No downtime. Just gradual migration while workloads stay running.

    I started with the non-production cluster as a test. The entire migration took maybe 30 minutes, and the cluster never went offline. Workloads migrated seamlessly from old nodes to new nodes.

    As of today, I’m 1 cluster down, 2 to go. The non-production cluster has been running happily on VLAN 50 for a few hours with zero issues.

    What I Learned

    1. bridge-vids is a lie. Or rather, it’s not a lie—it just doesn’t do what you think it does. It configures bridge ports, not the bridge device. Always add explicit post-up commands for VLAN membership.
    2. Never restart networking on Proxmox with running VMs. Just don’t. Either shutdown VMs first, or reboot the whole host. Future you will thank past you.
    3. Blue-green migrations work brilliantly for Kubernetes. Provision new nodes, migrate workloads, remove old nodes. No downtime, no drama.
    4. Stop and start, never restart. When you change VM VLAN configuration, you need to stop the VM then start it. Restart doesn’t recreate the tap interface with new VLAN membership.
    5. Test on simple hosts first. I started with pmxdell (1 VM) before tackling pxhp (17 VMs). That saved me from debugging VLAN issues with production workloads running.

    The Current State

    Infrastructure migration progress:

    • ✅ Proxmox hosts: Both on VLAN 60 (Management)
    • ✅ Kubernetes (non-prod): 3 VMs on VLAN 50
    • ✅ Kubernetes (internal): 7 VMs on VLAN 50
    • ✅ Kubernetes (production): 5 VMs on VLAN 50

    Next steps: Monitor the clusters for 24-48 hours, then migrate internal cluster. Production cluster goes last because I’m not completely reckless.

    You’re missing an agent…

    The astute among you may notice that my internal cluster went from 8 nodes to 7. As I was cycling nodes, I took the time to check the resources on that cluster, and realized that some unrelated work to consolidate observability tools let me scale down to 4 agents. My clusters have started the year off right by losing a little weight.

    Part 3 of the home network rebuild series. Read Part 2: From “HideYoWifi” to “G-Unit”

  • Busy Summer…

    A lot has gone on this summer. Work efforts have kept me busy, and I have spent a lot of “off” time researching ways to improve our services at work. That said, I have had some time to get a few things done at home.

    Proxmox Move

    I was able to get through my move to Proxmox servers. It was executed, roughly, as follows:

    • Moved my windows VMs, using this post as a guide.
    • Created new scripts to provision RKE2 nodes in Proxmox.
    • Provision new RKE2 node VMs on my temporary Proxmox node, effectively migrating the clusters from Hyper-V to Proxmox.
    • Wipe and install Proxmox on my server.
    • Provision new RKE2 node VMs on my new server, effectively migrating the clusters (again) to the new server.

    I have noticed that, when provisioning new machines via a VM clone, my IO delay gets a bit high, and some of the other VMs don’t like that. For now, it’s manageable, as I don’t provision often, but as I plan out a new cluster, disk IO is something to keep in mind.

    Moving my DNS

    I moved my DNS server to my Unifi Cloud Gateway Max. The Unifi Controller running on there has been very stable, and I am already communicating with it’s API to provision fixed IPs based on MAC addresses, so adding local DNS records was the next step.

    Thankfully, I rebuilt my Windows domain to use a different domain than my normal DNS routing. So I was able to move my routing domain to the UCG and add a forwarding record to my Windows domain. At that point, the only machines left on the domain were the domain joined ones.

    Getting rid of the domain

    At this point, I am considering decommissioning my Windows Domain. However, I have a few more moves to make before that happens. As luck would have it< i have some ideas as to how to make it work. Unfortunately for my readers, that will come in a later post.

    Oh, and, another teaser…. I printed a new server rack. More show and tell later!

  • Summer Project – Home Lab Refactor

    As with any good engineer, I cannot leave well enough alone. My current rainy day project is reconfiguring my home lab for some much needed updates and simplification.

    What’s Wrong?

    My home lab is, well, still going strong. My automation scripts work well, and I don’t spend a ton of time doing what I need to do to keep things up to date, at least when it comes to my Kubernetes clusters.

    The other servers, however, are in a scary spot. Everything is running on top of the free version of Windows Hyper-V Server from 2019, so general updates are a concern. I would LOVE to move to Windows Server 2025, but I do not have the money for that kind of endeavor.

    The other issue with running a Windows Server is that, well, they usually expected a Windows Domain (or, at least, my version does). This requirement has forced me to run my own domain controllers for a number of years now. Earlier iterations of my lab included a lot of Windows VMs, so the domain helped me manage authentication across them all. But, with RKE2 and Kubernetes running the bulk of my workloads, the domain controllers are more hassle than anything right now.

    The Plan

    My current plan is to migrate my home server to Proxmox. It seems a pretty solid replacement for Hyper-V, and has a few features in it that I may use in the future, like using cloud-init for creating new cluster nodes and better management of storage.

    Obviously, this is going to require some testing, and luckily, my old laptop is free for some experimentation. So I installed Proxmox there and messed around, and I came up with an interesting plan.

    • Migrate my VMs to my laptop instance of Proxmox, reducing the workload as much as I can.
    • Install Proxmox on my server
    • Create a Proxmox cluster with my laptop and server as the nodes.
    • Transfer my VMs from the laptop node to the server node.

    Cutting my Workload

    My laptop is a paltry 32GB of RAM, compared to 288 GB in my server. While I need to get everything “over” to the laptop, it doesn’t all have to be running at the same time.

    For the windows VMs, my current plan is as follows:

    • Move my primary domain controller to the laptop, but run at a reduced capacity (1 CPU/2GB).
    • Move my backup DC to the laptop, shut it down.
    • Move and shut down both SQL Server instances: they are only running lab DBs, nothing really vital.

    For my clusters, I’m not actually going to “move” the VMs. I’m going to create new nodes on the laptop proxmox, add them to the clusters, and then deprovision the old ones. This gives me some control over what’s there.

    • Non-Production Cluster -> 1 control plane server, 2 agents, but shut them down.
    • Internal Cluster -> 1 control plane server (down from 3), 3 agents, all shut down.
    • Production Cluster -> 1 control plane (down from 3), 2 agents, running vital software. I may need to migrate my HC Vault instance to the production cluster just to ensure secrets stay up and running.

    With this setup, I should really only have 4 VMs running on my laptop, which it should be able to handle. Once that’s done, I’ll have time to install and configure Proxmox on the server, and then move VMs from the laptop to the server.

    Lots to do

    I have a lot of learning to do. Proxmox seems pretty simple to start, but I find I’m having to read a lot about the cloning and cloud-init pieces to really make use of the power of the tool.

    Once I feel comfortable with Proxmox, the actual move will need scheduled… So, maybe by Christmas I’ll actually have this done.