What’s in a (Release) Name?

With Rancher gone, one of my clusters was dedicated to running Argo and my standard cluster tools. Another cluster has now become home for a majority of the monitoring tools, including the Grafana/Loki/Mimir/Tempo stack. That second cluster was running a little hot in terms of memory and CPU. I had 6 machines running what 4-5 machines could be doing, so a consolidation effort was in order. The process went fairly smoothly, but a small hiccup threw me for a frustrating loop.

Migrating Argo

Having configured Argo to manage itself, I was hoping that a move to a new cluster would be relatively easy. I was not disappointed.

For reference:

  • The ops cluster is the cluster that was running Rancher and is now just running Argo and some standard cluster tools.
  • The internal cluster is the cluster that is running my monitoring stack and is the target for Argo in this migration

Migration Process

  • I provisioned two new nodes for my internal cluster, and added them as workers to the cluster using the rke command line tool.
  • To prevent problems, I updated Argo CD to the latest version and disabled auto-sync on all apps.
  • Within my ops-argo repository, I changed the target cluster of my internal applications to https://kubernetes.default.svc. Why? Argo treats the cluster that it is installed a bit differently than external clusters. Since I was moving Argo on to my internal cluster, the references to my internal cluster needed to change.
  • Exported my ArgoCD resources using the Argo CD CLI. I was not planning on using the import, however, because I wanted to see how this process would work.
  • I modified my external Nginx proxy to point my Argo address from the ops cluster to the internal cluster. This essentially locked everything out of the old Argo instance, but let it run on the cluster should I need to fall back.
  • Locally, I ran helm install argocd . -n argo --create-namespace from my ops-argo repository. This installed Argo on my internal cluster in the argo namespace. I grabbed the newly generated admin password and saved it in my password store.
  • Locally, I ran helm install argocd-apps . -n argo from my ops-argo repository. This installs the Argo Application and Project resources which serve as the “app of apps” to bootstrap the rest of the applications.
  • I re-added my nonproduction and production clusters to be managed by Argo using the argocd cluster add command. As the URLs for the cluster were the same as they were in the old cluster, the apps matched up nicely.
  • I re-added the necessary labels to each cluster’s ArgoCD secret. This allows the cluster generator to create applications for my external cluster tool. I detailed some of this in a previous article, and the ops-argo repository has the ApplicationSet definitions to help.

And that was it… Argo found the existing applications on the other clusters, and after some re-sync to fix some labels and annotations, I was ready to go. Well, almost…

What happened to my metrics?

Some of my metrics went missing. Specifically, many of those that were coming from the various exporters in my Prometheus install. When I looked at Mimir, the metrics just stopped right around the time of my Argo upgrade and move. I checked the local Prometheus install, and noticed that those targets were not part of the service discovery page.

I did not expect Argo’s upgrade to change much, so I did not take notice of the changes. So, I was left with digging around into my Prometheus and ServiceMonitor instances to figure out why they are not showing.

Sadly, this took way longer than I anticipated. Why? There were a few reasons:

  1. I have a home improvement project running in parallel, which means I have almost no time to devote to researching these problems.
  2. I did not have Rancher! One of the things I did use Rancher for was to easily view the different resources in the cluster and compare. I finally remembered that I could use Lens for a GUI into my clusters, but sadly, it was a few days into the ordeal.
  3. Everything else was working, I was just missing a few metrics. On my own personal severity level, this was low. On the other hand, it drove my OCD crazy.

Names are important

After a few days of hunting around, I realized that the ServiceMonitor’s matchLabels did not match the labels on the Service objects themselves. Which was odd, because, I had not changed anything in those charts, and they are all part of the Bitnami kube-prometheus Helm chart that I am using.

As I poked around, I realized that I was using the releaseName property on the Argo Application spec. After some searching, I found the disclaimer on Argo’s website about using releaseName. As it turns out, this disclaimer describes exactly the issue that I was experiencing.

I spent about a minute to see if I could fix it without removing the releaseName properties from my cluster tools, I realized that the easiest path was to remove that releaseName property from the cluster tools that used it. That follows the guidance that Argo suggests, and keeps my configuration files much cleaner.

After removing that override, the ServiceMonitor resources were able to find their associated services, and showed up in Prometheus’ service discovery. With that, now my OCD has to deal with a gap in metrics…

Missing Node Exporter Metrics