Tag: packer.io

Environment Woes
No, this is not a post on global warming. As it turns out, I have been provisioning my Azure DevOps build agents somewhat incorrectly, at least for certain toolsets.

Sonar kicks it off

It started with this error in my build pipeline:
```
ERROR: 

The version of Java (11.0.21) used to run this analysis is deprecated, and SonarCloud no longer supports it. Please upgrade to Java 17 or later.
As a temporary measure, you can set the property 'sonar.scanner.force-deprecated-java-version' to 'true' to continue using Java 11.0.21
This workaround will only be effective until January 28, 2024. After this date, all scans using the deprecated Java 11 will fail.
```
I provision my build agents using the Azure DevOps/GitHub Actions runner images repository, so I know it has multiple versions of Java. I logged in to the agent, and the necessary environment variables (including JAVA_HOME_17_X64) are present. However, adding the jdkVersion input made no difference.
```
- task: SonarCloudAnalyze@1
  inputs:
    jdkversion: 'JAVA_HOME_17_X64'
```
I also tried using the JavaToolInstaller step to install Java 17 prior, and I got this error:
```
##[error]Java 17 is not preinstalled on this agent
```
Now, as I said, I KNOW it’s installed. So, what’s going on?

All about environment

The build agent has the proper environment variables set. As it turns out, however, the build agent needs some special setup. Some research on my end led me to Microsoft’s page on the Azure DevOps Linux Agents, specifically, the section on environmental variables.

I checked my .env file in my agent directory, and it had a scrawny 5-6 entries. As a test, I added JAVA_HOME_17_X64 with a proper path as an entry in that file and restarted the agent. Presto! No more errors, and Sonar Analysis ran fine.

Scripting for future agents

With this in mind, I updated the script that installs my ADO build agent to include steps to copy environment variables from the machine to the .env file for the agent, so that the agent knows what is on the machine. After a couple tests (forgot a necessary sudo), I have a working provisioning script.
February 5, 2024
Pack, pack, pack, they call him the Packer….
Through sheer happenstance I came across a posting for The Jaggerz playing near me and was taken back to my first time hearing “The Rapper.” I happened to go to school with one of the member’s kids, which made it all the more fun to reminisce.

But I digress. I spent time a while back getting Packer running at home to take care of some of my machine provisioning. At work, I have been looking for an automated mechanism to keep some of our build agents up to date, so I revisited this and came up with a plan involving Packer and Terraform.

The Problem

My current problem centers around the need to update our machine images weekly, but still using Terraform to manage our infrastructure. In the case of Azure DevOps, we can provision VM Scale Sets and assign those Scale Sets to an Azure DevOps agent pool. But, when I want to update that image, I can do it two different ways:
1. Using Azure CLI, I can update the Scale Set directly.
2. I can modify the Terraform repository to update the image and then re-run Terraform.
Now, #1 sounds easy, right? Run command and I’m done. But it then defeats the purpose of Terraform, which is to maintain infrastructure as code. So, I started down path #2.

Packer Revisit

I previously used Packer to provision Hyper-V VMs, but the provisioner for azure-rm is pretty similar. I was able to configure a simple windows based VM and get the only application I needed installed with a Powershell script.

One app? On a build agent? Yes, this is a very particular agent, and I didn’t want to install it everywhere, so I created a single agent image with the necessary software.

Mind you, I have been using the runner-images Packer projects to build my Ubuntu agent at home, and we use them to build both Windows and Ubuntu images at work, so, by comparison, my project is wee tiny. But it gives me a good platform to test. So I put a small repository together with a basic template and a Powershell script to install my application, and it was time to build.

Creating the Build Pipeline

My build process should be, for all intents and purposes, one step that runs the packer build command, which will create the image in Azure. I found the PackerBuild@1 task, and thought my job was done. It would seem that the Azure DevOps task hasn’t kept up with the times, either that, or Packer’s CLI needs help.

I wanted to use the PackerBuild@1 task to take advantage of the service connection. I figured, if I could run the task with a service connection, I wouldn’t have to store service principal credential in a variable library. As it turns out… well, I would have to do that anyway.

When I tried to run the task, I got an error that “packer fix only supports json.” My template is in HCL format, and everything I have seen suggests that Packer would rather move to HCL. Not to be beaten, I looked at the code for the task to see if I could skip the fix step.

Not only could I not skip that step, but when I dug into the task, I noticed that I wouldn’t be able to use the service connection parameter with a custom template. So with that, my dreams of using a fancy task went out the door.

Plan B? Use Packer’s ability to grab environment variables as default values and set the environment variables in a Powershell script before I run the Packer build. It is not super pretty, but it works.
```
- pwsh: | 
    $env:ARM_CLIENT_ID = "$(azure-client-id)"
    $env:ARM_CLIENT_SECRET = "$(azure-client-secret)"
    $env:ARM_SUBSCRIPTION_ID = "$(azure-subscription-id)"
    $env:ARM_TENANT_ID = "$(azure-tenant-id)"
    Invoke-Expression "& packer build --var-file values.pkrvars.hcl -var vm_name=vm-image-$(Build.BuildNumber) windows2022.pkr.hcl"
  displayName: Build Packer
```
On To Terraform!

The next step was terraforming the VM Scale Set. If you are familiar with Terraform, the VM Scale Set resource in the AzureRM provider is pretty easy to use. I used the Windows VM Scale Set, as my agents will be Windows based. The only “trick” is finding the image you created, but, thankfully, that can be done by name using a data block.
```
data "azurerm_image" "image" {
  name                = var.image_name
  resource_group_name = data.azurerm_resource_group.vmss_group.name
}
```
From there, just set source_image_id to data.azurerm_image.image.id, and you’re good. Why look this up by name? It makes automation very easy.

Gluing the two together

So I have a pipeline that builds an image, and I have another pipeline that executes the Terraform plan/apply steps. The latter is triggered on a commit to main in the Terraform repository, so how can I trigger a new build?

All I really need to do is “reach in” to the Terraform repository, update the variable file with the new image name, and commit it. This can be automated, and I spent a lot of time doing just that as part of implementing our GitOps workflow. In fact, as I type this, I realize that I probably owe a post or two on how exactly we have done that. But, using some scripted git commands, it is pretty easy.

So, my Packer build pipeline will checkout the Terraform repository, change the image name in the variable file, and commit. This is where the image name is important: Packer spit out the Azure Image ID (at least, not that I saw), so having a known name makes it easy for me to just tell Terraform to use the new image name, and it uses that to look up the value.

What’s next?

This was admittedly pretty easy, but only because I have been using Packer and Terraform for some time now. The learning curve is steep, but as I look across our portfolio, I can see areas where these types of practices can help us by allowing us to build fresh machine images on a regular cadence, and stop treating our servers as pets. I hope to document some of this for our internal teams and start driving them down a path of better deployment.
July 19, 2023
Speeding up Packer Hyper-V Provisioning
I spent a considerable amount of time working through the provisioning scripts for my RKE2 nodes. Each node took between 25-30 minutes to provision. I felt like I could do better.

Check the tires

A quick evaluation of the process quickly made me realize that most of the time is spent in the full install of Ubuntu. Using the hyperv-iso builder plugin from Packer, the machine would be provisioned from scratch. The installer took about 18-20 minutes to provision the VM fully. After that, the RKE2 install took about 1-2 minutes.

Speaking with my colleague Justin, it occurred to me that I could probably get away with building out a base image using the ISO provisioner and then using the hyperv-vmcx provisioner to copy that base and create a new machine. In theory, that would cut the 18-20 minutes down to a copy job.

Test Flight Alpha: Initial Cluster Provisioning

A quick copy of my existing full provisioner and some judicious editing got me to the point where the hyperv-vmcx provisioner was running great and producing a VM. I had successfully cut my provisioning time down to under 5 minutes!

I started editing my Rke2-Provisioning Powershell module to utilize the quick provisioning rather than the full provisioning. So I spun up a test cluster with 4 nodes (3 servers and one agent) to make sure everything came up correctly. And within about 30 minutes, that four node cluster was humming along in a quarter of the time it had taken me before.

Test Flight Beta: Node Replacement

The next bit of testing was to ensure that as I ran the replacement script, new machines were provisioned correctly and old machines were torn down. This is where I ran into a snag, but it was a bit difficult to detect at first.

During the replacement, the first new node would come up fine, and the old node was properly removed and deleted. So, after the first cycle, I had one new node and one old node removed. However, I was getting somewhat random problems with the second, third, and fourth cycles. Most of the time, it was that the ETCD server, during Rancher provisioning, was picking up an IP address from the DHCP range, instead of using the fixed range tied to the MAC address.

Quick Explanation

I use the Unifi Controller to run my home network (Unifi Security Gateway and several access points). Through the Unifi APIs, and a wrapper API I wrote, I am able to generate a valid Hyper-V MAC address and associate it with a fixed IP on the Unifi before the Hyper-V is ever configured. When I create a new machine, I assign it the MAC address that was generated, and my DHCP server always assigns it the same address. This IP is outside of the allocated DHCP range for normal clients. I am working on publishing the Unifi IP Wrapper in a public repository for consumption.

Back to it..

As I was saying, even though I was assigning a MAC address that had an associated fixed IP, VMs provisioned after the first one seemed to be failing to pick up that IP. What was different?

Well, deleting a node returns its IP to the pool, so the process looks something like this:
- First new node provisioned (IP .45 assigned)
- First old node deleted (return IP .25 to the pool)
- Second new node provisioned (IP .25 assigned)
My assumption is that the Unifi does not like such a quick reassignment of a static IP to a new MAC Address. To test this, I modified the provisioner to first create ALL the new nodes before deleting nodes.

In that instance, the nodes provisioned correctly using their newly assigned IP. However, from a resource perspective, I hate the though of having to run 2n nodes during provisioning, when really all I need is n + 1.

Test Flight Charlie: Changing IP assignments

I modified my Unifi Wrapper API to cycle through the IP block I have assigned to my VMs instead of simply always using the lowest available IP. This allows me to go back to replacement one by one, without worrying about IP/MAC Address conflicts on the Unifi.

Landing it…

With this improvement, I have fewer qualms about configuring provisioning to run in the evenings. Most likely, I will build the base Ubuntu image weekly or bi-weekly to ensure I have the latest updates. From there, I can use the replacement scripts to replace old nodes with new nodes in the cluster.

I have not decided if I’m going to use a simple task scheduler in Windows, or use an Azure DevOps build agent on my provisioner… Given my recent miscue when installing the Azure DevOps Build Agent, I may opt for the former.
February 22, 2023
A big mistake and a bit of bad luck…

In the Home Lab, things were going good. Perhaps a little too good. A bonehead mistake on my part and hardware failure combined to make another ridiculous weekend. I am beginning to think this blog is becoming “Matt messed up again.”

Permissions are a dangerous thing

I wanted to install the Azure DevOps agent on my hypervisor to allow me to automate and schedule provisioning of new machines. That would allow the provisioning to occur overnight and be overall less impactful. And it is always a bonus when things just take care of themselves.

I installed the agent, but it would not start. It was complaining that it need permissions to basically the entire drive where it was installed. Before really researching or thinking to much about it, I set about giving the service group access to the root of the drive.

Now, in retrospect, I could have opened the share on my laptop (\\machinename\c$), right clicked in the blank area, and chose Properties from there, which would have got me into the security menu. I did not realize that at the time, and I used the Set-ACL Powershell command.

What I did not realize that Set-ACL causes a full replacement, it is not additive. So, while I thought I was adding permissions for a group, what I was really doing was REMOVING EVERYONE ELSE’S PRIVILEDGES from the drive, and replacing it with group access. I realized my error when I simply had no access to the C: drive…

I thought I got it back…

After panicking a bit, I realized that what I had added wasn’t a user, but a group. I was able to get into the Group Policy editor for the server and add the Domain Admins group to that service group, which got my user account access. From there, I started rebuilding permissions on the C drive. Things were looking up.

I was wrong…

Then, I decided it would be a good idea to install Windows updates on the server and reboot. That was a huge mistake. The server got into a boot loop, where it would boot, attempt to do the updates, fail, and reboot, starting the process over again. It got worse…

I stopped the server completely during one of the boot cycles for a hard shutdown/restart. When the server posted again, the post said, well, basically, that the cache module in the server was no longer there, so it shut off access to my logical drives…. All of them.

What does that mean, exactly? Long story short, my HP Gen8 server has a Smart Array that had a 1GB write cache card in it. That card is, as best I can tell, dead. However, there was a 512MB write cache card in my backup server. I tried a swap, and it was not recognized either. So, potentially, the cache port itself is dead. Either way, my drives were gone.

Now what?

I was pretty much out of options. While my data was pretty much safe and secure on the Synology, all of my VMs were down for the count. My only real option was to see if I could get the server to re-mount the drives without the cache and start rebuilding.

I setup the drives in the same configuration I had previously. I have two 146GB drives and 2 1TB drives, so I paired them up into two RAID1 arrays. I caught a break: the machine recognized the previous drives and I did not lose any data. Now, the C drive was, well, toast: I believe my Set-ACL snafu just put that windows install out of commission. But all of my VMs were on the D drive.

So I re-installed Hyper V Server 2019 on the server and got to work attempting to import and start VMs. Once I got connected to the Server, I was able to re-import all of my Ubuntu VMs, which are my RKE2 nodes. They started up, and everything was good to go.

There was a catch…

Not everything came back. Specifically, ALL of my Windows VMs would not boot. They imported fine, but when it came time to boot, I got a “File not found” exception. I honestly have no idea why. I even had a backup of my Domain Controller, taken using Active Business Backup on the Synology. I was able to restore it, however, it would not start, throwing the same error.

My shot in the dark is the way the machines were built: I had provisioned the Windows machines manually, while the Ubuntu machines use Packer. I’m wondering if the export/import process that is part of the Packer process may have moved some vital files that I lost because those actions do not occur with a manual provision.

At this point, I’m rebuilding my windows machines (domain controllers and SQL servers). Once that is done, I will spend some time experimenting on a few test machines to make sure my backups are working… I suppose that’s what disaster recovery tests are for.

February 21, 2023
Packer.io : Making excess too easy

As I was chatting with a former colleague the other day, I realized that I have been doing some pretty diverse work as part of my home lab. A quick scan of my posts in this category reveal a host of topics ranging from home automation to Python monitoring to Kubernetes administration. One of my colleague’s questions was something to the effect of “How do you have time to do all of this?”

As I thought about it for a minute, I realized that all of my Kubernetes research would not have been possible if I had not first taken the opportunity to automate the process of provisioning Hyper-V VMs. In my Kubernetes experimentation, I have easily provisioned 35-40 Ubuntu VMs, and then promptly broken two-thirds of them through experimentation. Thinking about taking the time to install Ubuntu and provision it before I can start work, well, that would have been a non-starter.

It started with a build…

In my corporate career, we have been moving more towards Azure DevOps and away from TeamCity. To date, I am impressed with Azure DevOps. Pipelines-as-code appeals to my inner geek, and not having to maintain a server and build agents has its perks. I had visions of migrating from TeamCity to Azure DevOps, hoping I could take advantage of Microsoft’s generosity with small teams. Alas, Azure DevOps is free for small teams ONLY if you self host your build agents, which meant a small amount of machine maintenance.. I wanted to be able to self-host agents with the same software that Microsoft uses for their Github Actions/Azure DevOps agents. After reading through the Github Virtual Environments repository, I determined it was time to learn Packer.

The build agents for Github/Azure Devops are provisioned using Packer. My initial hope was that I would just be able to clone that repository, run packer, and viola! It’s never that easy. The Packer projects in that repository are designed to provision VM images that run in Azure, not on Hyper-V. Provisioning Hyper-V machines is possible through Packer, but requires different template files and some tweaking of the provisioning scripts. Without getting too much into the weeds, Packer uses different builders for Azure and Hyper-V. So I had to grab all the provisioning scripts I wanted from the template files in the Virtual Environments repository, but configure a builder for Hyper-V. Thankfully, Nick Charlton provided a great starting point for automating Ubuntu 20.04 installs with Packer. From there, I was off to the races.

Enabling my excess

Through probably 40 hours of trial and error, I got to the point where I was building my own build agents and hooking them up to my Azure DevOps account. It should be noted that fully provisioning a build agent takes six to eight hours, so most of that 40 hours was “fire and forget.” With that success, I started to think: “Could I provision simpler Ubuntu servers and use those to experiment with Kubernetes?”

The answer, in short, is “Of course!” I went about creating some Powershell scripts and Packer templates so that I could provision various levels of Ubuntu servers. I have shared those scripts, along with my build agent provisioning scripts, in my provisioning-projects repository on Github. With those scripts, I was off to the races, provisioning new machines at will. It is remarkable the risks you will take in a lab environment, knowing that you are only 20-30 minutes away from a clean machine should you mess something up.

A note on IP management

If you dig into the repository above, you may notice some scripts and code around provisioning a MAC address from a “Unifi IP Manager.” I created a small API wrapper that utilizes the Unifi Controller APIs to create clients with fixed IP addresses. The API generates a random, but valid, MAC Address for Hyper-V, then uses the Unifi Controller API to assign a fixed IP.

That project isn’t quite ready for public consumption, but if you are interested, drop a comment on this post.

September 24, 2021