I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.
Check the tires
A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the
hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.
Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the
hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.
Test Flight Alpha: Initial Cluster Provisioning
A quick copy of my existing full provisioner and some judicious editing got me to the point where the
hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!
I started editing my
Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.
Test Flight Beta: Node Replacement
The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.
During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.
I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.
Back to it..
As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?
Well, deleting a node returns its IP to the pool, so the process looks something like this:
- First new node provisioned (IP .45 assigned)
- First old node deleted (return IP .25 to the pool)
- Second new node provisioned (IP .25 assigned)
My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.
In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.
Test Flight Charlie: Changing IP assignments
I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.
With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.
I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.