Author: Matt

  • Publishing Code Coverage in both Azure DevOps and SonarQube

    I spent more time than I care to admit trying to get the proper configuration for reporting code coverage to both the Azure DevOps pipeline and SonarQube. The solution was, well, fairly simple, but it is worth me writing down.

    Testing, Testing…

    After fumbling around with some of the linting and publishing to SonarQube’s Community Edition, I succeeded in creating build pipelines which, when building from the main branch, will run SonarQube analysis and publish to the project.

    I modified my build template as follows:

    - ${{ if eq(parameters.execute_sonar, true) }}:
        # Prepare Analysis Configuration task
        - task: SonarQubePrepare@5
          inputs:
            SonarQube: ${{ parameters.sonar_endpoint_name }}
            scannerMode: 'MSBuild'
            projectKey: ${{ parameters.sonar_project_key }}
    
      - task: DotNetCoreCLI@2
        displayName: 'DotNet restore packages (dotnet restore)'
        inputs:
          command: 'restore'
          feedsToUse: config
          nugetConfigPath: "$(Build.SourcesDirectory)/nuget.config"
          projects: "**/*.csproj"
          externalFeedCredentials: 'feedName'
    
      - task: DotNetCoreCLI@2
        displayName: Build (dotnet build)
        inputs:
          command: build
          projects: ${{ parameters.publishProject }}
          arguments: '--no-restore --configuration ${{ parameters.BUILD_CONFIGURATION }} /p:InformationalVersion=$(fullSemVer) /p:AssemblyVersion=$(AssemblySemVer) /p:AssemblyFileVersion=$(AssemblySemFileVer)'
          
    
    ## Test steps are here, details below
    
      - ${{ if eq(parameters.execute_sonar, true) }}:
        - powershell: |
            $params = "$env:SONARQUBE_SCANNER_PARAMS" -replace '"sonar.branch.name":"[\w,/,-]*"\,?'
            Write-Host "##vso[task.setvariable variable=SONARQUBE_SCANNER_PARAMS]$params"
    
        # Run Code Analysis task
        - task: SonarQubeAnalyze@5
    
        # Publish Quality Gate Result task
        - task: SonarQubePublish@5
          inputs:
            pollingTimeoutSec: '300'

    In the code above, the execute_sonar parameter allows me to execute the Sonarqube steps only in the main branch, which allows me to keep the community edition happy but retain the rest of my pipeline definition on feature branches.

    This configuration worked and my project’s analysis showed in Sonarqube.

    Testing, Testing, 1, 2, 3…

    I went about adding some trivial unit tests in order to verify that I could get code coverage publishing. I have had experience with Coverlet in the past, and it allows for generation of various coverage report formats, so I went about adding it to the project using the simple collector.

    Within my build pipeline, I added the following (if you are referencing the snippet above, this is in place of the comment):

    - ${{ if eq(parameters.execute_tests, true) }}:
        - task: DotNetCoreCLI@2
          displayName: Test (dotnet test)
          inputs:
            command: test
            projects: '**/*tests/*.csproj'
            arguments: '--no-restore --configuration ${{ parameters.BUILD_CONFIGURATION }} --collect:"XPlat Code Coverage"'
        

    Out of the box, this command worked well: tests were executed, and both Tests and Coverage information were published to Azure DevOps. Apparently, the DotNetCoreCLI@2 task defaults publishTestResults to true.

    While the tests ran, the coverage was not published to Sonarqube. I had hoped that the Sonarqube Extension for Azure DevOps would pick this up, but, at last, this was not the case.

    Coverlet, DevOps, and Sonarqube… oh my.

    As it turns out, you have to tell Sonarqube explicitly where to find the coverage report. And, while the Sonarqube documentation is pretty good at describing how to report coverage to Sonarqube from the CLI, the Azure DevOps integration documentation does not specify how to accomplish this, at least not outright. Also, while Azure DevOps recognizes cobertura coverage reports, Sonarqube prefers opencover reports.

    I had a few tasks ahead of me:

    1. Generate my coverage reports in two formats
    2. Get a specific location for the coverage reports in order to pass that information to Sonarqube.
    3. Tell the Sonarqube analyzer where to find the opencover reports

    Generating Multiple Coverage Reports

    As mentioned, I am using Coverlet to collect code coverage. Coverlet allows for the usage of RunSettings files, which helps to standardize various settings within the test project. It also allowed me to generate two coverage reports in different formats. I created a coverlet.runsettings file in my test project’s directory and added this content:

    <?xml version="1.0" encoding="utf-8" ?>
    <RunSettings>
      <DataCollectionRunSettings>
        <DataCollectors>
          <DataCollector friendlyName="XPlat code coverage">
            <Configuration>
              <Format>opencover,cobertura</Format>          
            </Configuration>
          </DataCollector>
        </DataCollectors>
      </DataCollectionRunSettings>
    </RunSettings>

    To make it easy on my build and test pipeline, I added the following project to my test csproj file:

    <RunSettingsFilePath>$(MSBuildProjectDirectory)\coverlet.runsettings</RunSettingsFilePath>

    There are a number of settings for coverlet that can be set in the runsettings file, the full list can be found in their VSTest Integration documentation.

    Getting Test Result Files

    As mentioned above, the DotNetCoreCLI@2 action defaults publishTestResults to true. This setting adds some arguments to the test command to output trx logging and setting a results directory. However, this means I am not able to specify a results directory on my own.

    Even specifying the directory myself does not fully solve the problem: Running the tests with coverage and trx logging generates a trx file using username/computer name/timestamp AND two sets of coverage reports: one set is stored under the username/computer name/timestamp folder, and the other under a random Guid.

    To ensure I only pulled one set of tests, and that Azure DevOps didn’t complain, I updated my test execution to look like this:

      - ${{ if eq(parameters.execute_tests, true) }}:
        - task: DotNetCoreCLI@2
          displayName: Test (dotnet test)
          inputs:
            command: test
            projects: '**/*tests/*.csproj'
            publishTestResults: false
            arguments: '--no-restore --configuration ${{ parameters.BUILD_CONFIGURATION }} --collect:"XPlat Code Coverage" --logger trx --results-directory "$(Agent.TempDirectory)"'
        
        - pwsh:
            Push-Location $(Agent.TempDirectory);
            mkdir "ResultFiles";
            $resultFiles = Get-ChildItem -Directory -Filter resultfiles;
    
            $trxFile = Get-ChildItem *.trx;
            $trxFileName = [System.IO.Path]::GetFileNameWithoutExtension($trxFile);
                  
            Push-Location $trxFilename;
            $coverageFiles = Get-ChildItem -Recurse -filter coverage.*.xml;
            foreach ($coverageFile in $coverageFiles) 
            {
              Copy-Item $coverageFile $resultFiles.FullName;
            }
            Pop-Location;
          displayName: Copy Test Files
    
        - task: PublishTestResults@2
          inputs:
            testResultsFormat: 'VSTest' # 'JUnit' | 'NUnit' | 'VSTest' | 'XUnit' | 'CTest'. Alias: testRunner. Required. Test result format. Default: JUnit.
            testResultsFiles: '$(Agent.TempDirectory)/*.trx'
    
        - task: PublishCodeCoverageResults@1
          condition: true # always try publish coverage results, even if unit tests fail
          inputs:
            codeCoverageTool: 'cobertura' # Options: cobertura, jaCoCo
            summaryFileLocation: '$(Agent.TempDirectory)/ResultFiles/**/coverage.cobertura.xml'

    I ran the dotnet test command with custom arguments to log to trx and set my own results directory. The Powershell script uses the name of the TRX file to find the coverage files, and copies them to a ResultsFiles folder. Then I added tasks to publish test results and code coverage results to the Azure DevOps pipeline.

    Pushing coverage results to Sonarqube

    I admittedly spent a lot of time chasing down what, in reality, was a very simple change:

      - ${{ if eq(parameters.execute_sonar, true) }}:
        # Prepare Analysis Configuration task
        - task: SonarQubePrepare@5
          inputs:
            SonarQube: ${{ parameters.sonar_endpoint_name }}
            scannerMode: 'MSBuild'
            projectKey: ${{ parameters.sonar_project_key }}
            extraProperties: |
              sonar.cs.opencover.reportsPaths="$(Agent.TempDirectory)/ResultFiles/coverage.opencover.xml"
    
    ### Build and test here
    
      - ${{ if eq(parameters.execute_sonar, true) }}:
        - powershell: |
            $params = "$env:SONARQUBE_SCANNER_PARAMS" -replace '"sonar.branch.name":"[\w,/,-]*"\,?'
            Write-Host "##vso[task.setvariable variable=SONARQUBE_SCANNER_PARAMS]$params"
        # Run Code Analysis task
        - task: SonarQubeAnalyze@5
        # Publish Quality Gate Result task
        - task: SonarQubePublish@5
          inputs:
            pollingTimeoutSec: '300'
    

    That’s literally it. I experimented with a lot of different settings, but, in the end, simply setting sonar.cs.opencover.reportsPaths in the extraProperties input of the SonarQubePrepare task.

    SonarQube Success!

    Sample SonarQube Report

    In the small project that I tested, I was able to get analysis and code coverage published to my SonarQube instance. Unfortunately, this means that I know have technical debt to fix and unit tests to write in order to improve my code coverage, but, overall, this was a very successful venture.

  • Tech Tips – Adding Linting to C# Projects

    Among the Javascript/Typescript community, ESlint and Prettier are very popular ways to enforce some standards and formatting within your code. In trying to find similar functionality for C#, I did not find anything as ubiquitous as ESLint/Prettier, but there are some front runners.

    Roslyn Analyzers and Dotnet Format

    John Reilly has a great post on enabling Roslyn Analyzers in your .Net applications. He also posted some instructions on using the dotnet format tool as a “Prettier for C#” tool.

    I will not bore you by re-hashing his posts, but following those posts allowed me to apply some basic formatting and linting rules to my projects. Additionally, the Roslyn Analyzers can be made to generate build warnings and errors, so any build worth its salt (builds that fail with warnings) will be free of undesirable code.

    SonarLint

    I was not really content to stop there, and a quick Google search led me to an interesting article around linting options for C#. One of those was SonarLint. While SonarLint bills itself as an IDE plugin, it has a Roslyn Analyzer package (SonarAnalyzer.CSharp) that can be added and configured in a similar fashion to the built-in Roslyn Analyzers.

    Following the instructions in the article, I installed SonarAnalyzer and configured it alongside the base Roslyn Analyzers. It produced a few more warnings, particularly around some best practices from Sonar that go beyond what the Microsoft standards apply.

    SonarQube, my old friend

    Getting into SonarLint brought be back to SonarQube. What seems like forever ago, but really was only a few years ago, SonarQube was something of a go-to tool in my position. We had hoped to gather a portfolio-wide view of our bugs, vulnerabilities, and code smells. For one reason or another, we abandoned that particular tool set.

    After putting SonarLint in place, I was interested in jumping back in, at least in my home lab, to see what kind of information I could get out of Sonar. I found the Kubernetes instructions and got to work setting up a quick instance on my production instance, alongside my Proget instance.

    Once installed, I have to say, the application has done well to improve the user experience. Tying in to my Azure DevOps instance was quick and easy, with very good in-application tutorials for that configuration. I setup a project based on the pipeline for my test application, made my pipeline changes, and waited for results…

    Failed! I kept getting errors about not being allowed to set the branch name in the Community edition. That is fair, and for my projects, I only really need analysis on the main branch, so I setup analysis to only happen on builds of main. Failed again!

    There seems to be a known issue around this, but thanks to the SonarSource community, I found a workaround for my pipeline. With that in place, I had my code analysis in place, but, well, what do I do with it? Well, I can add quality gates to fail builds based on missing code coverage, tweak my rule sets, and have a “portfolio wide” view of my private projects.

    Setting the Standard

    For any open source C# projects, simply building the linting/formatting into the build/commit process might be enough. If project maintainers are so inclined, they can add their projects to SonarCloud and get the benefits of SonarQube (including adding quality gates).

    For enterprise customers, the move to a paid tier depends on how much visibility you want in your code base. Sonar can be an expensive endeavor, but provides a lot of quality and tech debt tracking that you may find useful. My suggestion? Start with a trial or the community version, and see if you like it before you start requesting budget.

    Either way, setting standards for formatting and analysis on your C# projects make contributions across teams much easier and safer. I suggest you try it!

  • Deprecating Microsoft Teams Notifications

    My first “owned” open source project was a TeamCity plugin to send notifications to Microsoft Teams based on build notifications in Teamcity. It was based on a similar TeamCity plugin for Slack.

    Why? Well, out of necessity. Professionally, we were migrating to using MS Teams, and we wanted functionality to post messages when builds failed/succeeded. So I copied the Slack notifier, made the requisite changes, and it worked well enough to publish. I even went the extra mile of adding some GitHub actions to build and deploy, so that I could fix dependabot security issues quickly.

    The plugin is currently published in Jetbrains’ plugin repository.

    The Sun Always Sets

    Fast-forward 5 years: both professionally and personally I have moved towards Azure DevOps / GitHub Actions for building. Why? Well, the core of them is essentially the same, as Microsoft has melded them together. For open source projects in GitHub, it is a defacto standard, and for my lab instance of Azure DevOps, well, it makes transitioning lab work to professional recommendations much easier. But none of this uses TeamCity.

    Additionally, I have spent the majority of my professional career in C/C++/C#. Java is not incredibly different at its core, but add in Maven, Spring, and the other tag-alongs that come with TeamCity plugin development, and I was well out of my league. And while I have expanded into the various Javascript languages and frameworks, I have never had a reason to dive into Java to learn.

    So, with that, I am officially deprecating this plugin. Truthfully, I have not done much in the repository recently, so this should not be a surprise. However, I wanted to formally do this so that anyone who may want to take it over (or start over, if they so desire) can do so. I will gladly turn over ownership of the code to someone willing to spend their time to improve it.

    To those who use the plugin: I appreciate all of the support from the community, and I apologize for not doing this sooner: perhaps someone will take the reins and bring the plugin up to the state it deserves.

    Thanks!

  • Information Overload: When too much data becomes a problem

    I have had this post in a draft for almost a month now. I had planned to include statistics around the amount of data that humans are generating (it is a lot) and how we as are causing some of own problems by having too much data at our fingertips.

    What I realized is, a lengthy post about information overload is, well, somewhat oxymoronic. If you would like to learn about the theory, check it out. We are absolutely generating more data than could possibly be used. This came to the forefront as I investigated my metrics storage in my Grafana Mimir instance.

    I got a lot of… data

    Right now, I’m collecting over 300,000 series worth of data. That means, there are about 300,000 unique streams of data for which I have a data point roughly every 30 seconds. On average, it is taking up 35 GB worth of disk space per month.

    How many of those do I care about? Well, as of this moment, about 7. I have some alerts to monitor when applications are degraded, when I’m dropping logs, when some system temperatures go to high, and when my Ring Doorbell battery is low.

    Now, I continue to find alerts to write that are helpful, so I anticipate expanding beyond 7. However, there is almost no way that I am going to have alerts across 300,000 series: I simply do not care about some of this data. And yet, I am storing it, to the tune of about 35 GB worth of data every month.

    What to do?

    For my home lab, the answer is relatively easy: I do not care about data outside of 3 months, so I can setup retention rules and clean some of this up. But, in business, retention rules become a question around legal and contractual obligations.

    In other words, in business, not only are we generating a ton of data, but we can be penalized for not having the data that we generated, or even, not generating the appropriate data, such as audit histories. It is very much a downward spiral: the more we generate, the more we must store, which leads to larger and larger data stores.

    Where do we go from here?

    We are overwhelming ourselves with data, and it is arguably causing problems across business, government, and general interpersonal culture. The problem is not getting any better, and there really is not a clear solution. All we can do is attempt to be smart data consumers. So before you take that random Facebook ad as fact, maybe do a little more digging to corroborate. In the age where anyone can be a journalist, everyone has to be a journalist.

  • Pulling metrics from Home Assistant into Prometheus

    I have setup an instance of Home Assistant as the easiest front end for interacting with my home automation setup. While I am using the Universal Devices ISY994 as the primary communication hub for my Insteon devices, Home Assistant provides a much nicer interface for my family, including a great mobile app for them to use the system.

    With my foray into monitoring, I started looking around to see if I was able to get some device metrics from Home Assistant into my Grafana Mimir instance. Turns out, there is an a Prometheus integration built right in to Home Assistant.

    Read the Manual

    Most of my blog posts are “how to” style: I find a problem that maybe I could not find an exact solution for online, and walk you through the steps. In this case, though, it was as simple as reading the configuration instructions for the Prometheus integration.

    ServiceMonitor?

    Well, almost that easy. I have been using ServiceMonitor resources within my clusters, rather than setting up explicit scrape configs. Generally, this is easier to manage, since I just install the Prometheus operator, and then create ServiceMonitor instances when I want Prometheus to scrape an endpoint.

    The Home Assistant Prometheus endpoint requires a token, however, and I did not have the desire to dig in to configuring a ServiceMonitor with an appropriate secret. For now, it is a to-do on my ever-growing list

    What can I do now?

    This integration has opened up a LOT of new alerts on my end. Home Assistant talks to many of the devices in my home, including lights and garage doors. This means I can write alerts for when lights go on or off, when the garage door goes up or down, and, probably the best, when devices are reporting low battery.

    The first alert I wrote was to alert me when my Ring Doorbell battery drops below 30%. Couple that with my Prometheus Alerts module for Magic Mirror, and I now get a display when the battery needs changed.

    What’s Next?

    I am giving back to the community. The Prometheus integration for Home Assistant does not currently report cover statuses. Covers are things like shades or, in my case, garage doors. Since I would like to be able to alert when the garage door is open, I am working on a pull request to add cover support to the Prometheus integration.

    It also means I would LOVE to get my hands on some automated shades/blinds… but that sounds really expensive.

  • Bruce Lee to the Rescue! Health Checks for .NET Worker Services

    As we start to develop more containers that are being run in Kubernetes, we encounter non-http workloads. I came across a workload that represents a non-http processor for queued events. In .NET, I used the IHostedService offerings to run a simple service in a container to do this work.

    However, when it came time to deploy to Kubernetes, I quickly realized that my standard liveness/health checks would not work for this container. I searched around, and the HealthChecks libraries are limited to ASP.NET Core. Not wanting to bloat my image, I looked for some alternatives. My Google searches led me to Bruce Lee.

    No, not Bruce Lee the actor, but Bruce Lee Harrison. Bruce published a library called TinyHealthChecks, which provides the ability to add lightweight endpoints without dragging in the entire ASP.NET Core libraries.

    While it seems a pretty simple concept, it solved an immediate need of mine with minimal effort. Additionally, there was a sample and documentation!

    Why call this out? Many developers use open source software to solve these types of problems, and I feel as though they deserve a little publicity for their efforts. So, thanks to the contributors to TinyHealthCheck, I will certainly watch this repository and contribute as I can.

  • MMM-PrometheusAlerts: Display Alerts in Magic Mirror

    I have had MagicMirror running for about a year now, and I love having it in my office. A quick glance gives my family and I a look at information that is relevant for the days ahead. As I continue my dive into Prometheus for monitoring, it occurred to me that I might be able to create a new module for displaying Prometheus Alerts.

    Current State

    Presently, my Magic Mirror configuration uses the following modules:

    Creating the Prometheus Alerts module

    In recent weeks, my experimentation with Mimir has lead me to write some alerts to keep tabs on things in my Kubernetes cluster and, well, the overall health of my systems. Currently, I have a personal Slack team with an alerts channel, and that has been working nicely. However, as I stared at my office panel, it occurred to me that there should be a way to gather these alerts and show them in Magic Mirror.

    Since Grafana Mimir is Prometheus-compatible, I should be able to use the Prometheus APIs to get alert data. A quick Google search yielded the HTTP API for Prometheus.

    With that in hand, I copied the StatusPage IO module’s code and got to work. In many ways, the Prometheus Alerts are simpler than Status Page, since it is a single collection of alerts with labels and annotations. So I stripped out some of the extra handling for Status Page Components, renamed a few things, and after some debugging, I have a pretty good MVP.

    What’s next?

    It’s pretty good, but not perfect. I started adding some issues to the GitHub repository for things like message templating and authentication, and when I get around to adding authentication to Grafana Mimir and Loki, well, I’ll probably need to update the module.

    Watch the Github repository for changes!

  • Lessons in Managing my Kubernetes Cluster: Man Down!

    I had a bit of a panic this week as routine tasks took me down a rabbit hole in Kubernetes. The more I manage my home lab clusters, the more I realize I do not want to be responsible for bare metal clusters at work.

    It was a typical upgrade…

    With ArgoCD in place, the contents of my clusters is very neatly defined in my various infrastructure repositories. I even have a small Powershell script that checks for the latest versions of the tools I have installed and updates my repositories with the latest and greatest.

    I ran that script today and noticed a few minor updates to some of my charts, so I figured I would apply those at lunch. Pretty typically, it is a very smooth application and the updates are applied in a few minutes.

    However, after having ArgoCD sync, I realized that the Helm charts I was upgrading was stuck in a “Progressing” state. I checked the pods, and they were in a perpetual “Pending” state.

    My typical debugging found nothing: there were no events around the pod being unable to be scheduled for any particular reason, and there was no log file, since the pods were not even being created.

    My first Google searches suggested a problem with persistent volumes/persistent volume claims. So I poked around in those, going so far as deleting them (after backing up the folders in my NFS target), but, well, no luck.

    And there it was…

    As it was “unschedulable,” I started trying to find the logs for the scheduler. I could not find them. So I logged in to my control plane node to see if the scheduler was even running… It was not.

    I checked the logs for that container, and nothing stood out. It just kind of “died.” I tried restarting the Docker container…. Nope. I even tried re-running the rke up command for that cluster, to see if Rancher Kubernetes Engine would restart it for me…. Nope.

    So, running out of options, I changed my cluster.yml configuration file to add a control plane role to another node in the cluster and re-ran rke up. And that, finally, worked. At that point, I removed the control plane role from the old control plane node and modified the DNS entries to point my hostnames to the new control plane node. With that, everything came back up.

    Monitoring?

    I wanted to write an alert in Mimir to catch this so I would know about it before I dug around. It was at this point I realized that I am not collecting any metrics from the Kubernetes components themselves. And, frankly, I am not sure how. RKE installs metrics-server by default, but I have not found a way to scrape metrics from Kubernetes components.

    My Google searches have been fruitless, and it has been a long work week, so this problem will most likely have to wait for a bit. If I come up with something I will update this post.

  • Hitting for the cycle…

    Well, I may not be hitting for the cycle, but I am certainly cycling Kubernetes nodes like it is my job. The recent OpenSSL security patches got me thinking that I need to cycle my cluster nodes.

    A Quick Primer

    In Kubernetes, a “node” is, well, a machine performing some bit of work. It could be a VM or a physical machine, but, at its core, it is a machine with a container runtime of some kind that runs Kubernetes workloads.

    Using the Rancher Kubernetes Engine (RKE), each node can have one or more of the following roles

    • worker – The node can host user workloads
    • etcd – The node is a member of the etcd storage cluster
    • controlplane – The node is a member of the control plane

    Cycle? What and why

    When I say “cycling” the nodes, I am actually referring to the process of provisioning a new node, adding the new node to the cluster, and removing an old node from the cluster. I use the term “cycling” because, when the process is complete, my cluster is “back where it started” in terms of resourcing, but with a fresh and updated node.

    But, why cycle? Why not just run the necessary security patches on my existing nodes? In my view, even at the home lab level, there a few reasons for this method.

    A Clean and Consistent Node

    As nodes get older, they incur the costs of age. They may have some old container images from previous workloads, or some cached copies of various system packages. In general, they collect stuff, and a fresh node has none of that cost. By provisioning new nodes, we can ensure that the latest provisioning is run and all the necessary updates are installed.

    By using newly provisioned nodes each time, it prevents me from applying special configuration to nodes. If I need a particular configuration on a node, well, it has to be done in the provisioning scripts. All my cluster nodes are provisioned the same way, which makes them much more like cattle than pets.

    No Downtime or Capacity Loss

    Running node maintenance can potentially require a system reboot or service restarts, which can trigger downtime. In order to prevent downtime, it is recommended that nodes be “cordoned” (prevent new workload from being scheduled on that node) and “drained” (remove workloads from the node).

    Kubernetes is very good at scheduling and moving workloads between nodes, and there is built-in functionality for cordon and draining of nodes. However, to prevent downtime, I have to remove a node for maintenance, which means I lose some cluster capacity during maintenance.

    When you think about it, to do a zero-downtime update, your choices are:

    1. Take a node out of the cluster, upgrade it, then add it back
    2. Add a new node to the cluster (already upgraded) then remove the old node.

    So, applying the “cattle” mentality to our infrastructure, it is preferred to have disposable assets rather than precisely manicured assets.

    RKE performs this process for you when you remove a node from the cluster. That means, running rke up will remove old nodes if they have been removed from your cluster.yml.

    To Do: Can I automate this?

    As of right now, this process is pretty manual, and goes something like this:

    1. Provision a new node – This part is automated with Packer, but I really only like doing 1 at a time. I am already running 20+ VMs on my hypervisor, I do not like the thought of spiking to 25+ just to cycle a node.
    2. Remove nodes, one role at a time – I have found that RKE is most stable when you only remove one role at a time. What does that mean? It means, if I have a node that is running all three nodes, I need to remove the control plane role, then run rke up. Then remove the etcd role, and run rke up again. Then remove the node completely. I have not had good luck simply removing a node with all three roles.
    3. Ingress Changes – I need to change the IPs on my cluster in two places:
      • In my external load balancer, which is a Raspberry Pi running nginx.
      • In my nginx-ingress installation on the cluster. This is done via my GitOps repository.
    4. DNS Changes – I have aliases setup for the control plan nodes so that I can swap them in and out easily without changing other configurations. When I cycle a control plane node, I need to update the DNS.
    5. Delete Old Node – I have a small Powershell script for this, but it is another step.

    It would be wonderful if I could automate this into an Azure DevOps pipeline, but there are some problems with that approach.

    1. Packer’s Hyper-V builder has to run on the host machine, which means I need to be able to execute the Packer build commands on my server. I’m not about about put the Azure DevOps agent directly on my server.
    2. I have not found a good way to automate DNS changes, outside of using the Powershell module.
    3. I need to automate the IP changes in the external proxy and the cluster ingress. Both of which are possible but would require some research on my end.
    4. I would need to automate the RKE actions, specifically, adding new nodes and deleting roles/nodes from the cluster, and then running rke up as needed.

    None of the above is impossible, however, it would require some effort on my part to research some techniques and roll them up into a proper pipeline. For the time being, though, I have set my “age limit” for nodes at 60 days, and will continue the manual process. Perhaps, after a round or two, I will get frustrated enough to start the automation process.

  • What’s in a (Release) Name?

    With Rancher gone, one of my clusters was dedicated to running Argo and my standard cluster tools. Another cluster has now become home for a majority of the monitoring tools, including the Grafana/Loki/Mimir/Tempo stack. That second cluster was running a little hot in terms of memory and CPU. I had 6 machines running what 4-5 machines could be doing, so a consolidation effort was in order. The process went fairly smoothly, but a small hiccup threw me for a frustrating loop.

    Migrating Argo

    Having configured Argo to manage itself, I was hoping that a move to a new cluster would be relatively easy. I was not disappointed.

    For reference:

    • The ops cluster is the cluster that was running Rancher and is now just running Argo and some standard cluster tools.
    • The internal cluster is the cluster that is running my monitoring stack and is the target for Argo in this migration

    Migration Process

    • I provisioned two new nodes for my internal cluster, and added them as workers to the cluster using the rke command line tool.
    • To prevent problems, I updated Argo CD to the latest version and disabled auto-sync on all apps.
    • Within my ops-argo repository, I changed the target cluster of my internal applications to https://kubernetes.default.svc. Why? Argo treats the cluster that it is installed a bit differently than external clusters. Since I was moving Argo on to my internal cluster, the references to my internal cluster needed to change.
    • Exported my ArgoCD resources using the Argo CD CLI. I was not planning on using the import, however, because I wanted to see how this process would work.
    • I modified my external Nginx proxy to point my Argo address from the ops cluster to the internal cluster. This essentially locked everything out of the old Argo instance, but let it run on the cluster should I need to fall back.
    • Locally, I ran helm install argocd . -n argo --create-namespace from my ops-argo repository. This installed Argo on my internal cluster in the argo namespace. I grabbed the newly generated admin password and saved it in my password store.
    • Locally, I ran helm install argocd-apps . -n argo from my ops-argo repository. This installs the Argo Application and Project resources which serve as the “app of apps” to bootstrap the rest of the applications.
    • I re-added my nonproduction and production clusters to be managed by Argo using the argocd cluster add command. As the URLs for the cluster were the same as they were in the old cluster, the apps matched up nicely.
    • I re-added the necessary labels to each cluster’s ArgoCD secret. This allows the cluster generator to create applications for my external cluster tool. I detailed some of this in a previous article, and the ops-argo repository has the ApplicationSet definitions to help.

    And that was it… Argo found the existing applications on the other clusters, and after some re-sync to fix some labels and annotations, I was ready to go. Well, almost…

    What happened to my metrics?

    Some of my metrics went missing. Specifically, many of those that were coming from the various exporters in my Prometheus install. When I looked at Mimir, the metrics just stopped right around the time of my Argo upgrade and move. I checked the local Prometheus install, and noticed that those targets were not part of the service discovery page.

    I did not expect Argo’s upgrade to change much, so I did not take notice of the changes. So, I was left with digging around into my Prometheus and ServiceMonitor instances to figure out why they are not showing.

    Sadly, this took way longer than I anticipated. Why? There were a few reasons:

    1. I have a home improvement project running in parallel, which means I have almost no time to devote to researching these problems.
    2. I did not have Rancher! One of the things I did use Rancher for was to easily view the different resources in the cluster and compare. I finally remembered that I could use Lens for a GUI into my clusters, but sadly, it was a few days into the ordeal.
    3. Everything else was working, I was just missing a few metrics. On my own personal severity level, this was low. On the other hand, it drove my OCD crazy.

    Names are important

    After a few days of hunting around, I realized that the ServiceMonitor’s matchLabels did not match the labels on the Service objects themselves. Which was odd, because, I had not changed anything in those charts, and they are all part of the Bitnami kube-prometheus Helm chart that I am using.

    As I poked around, I realized that I was using the releaseName property on the Argo Application spec. After some searching, I found the disclaimer on Argo’s website about using releaseName. As it turns out, this disclaimer describes exactly the issue that I was experiencing.

    I spent about a minute to see if I could fix it without removing the releaseName properties from my cluster tools, I realized that the easiest path was to remove that releaseName property from the cluster tools that used it. That follows the guidance that Argo suggests, and keeps my configuration files much cleaner.

    After removing that override, the ServiceMonitor resources were able to find their associated services, and showed up in Prometheus’ service discovery. With that, now my OCD has to deal with a gap in metrics…

    Missing Node Exporter Metrics