Author: Matt

  • A Lesson in Occam’s Razor: Configuring Mimir Ruler with Grafana

    Occam’s Razor posits “Of two competing theories, the simpler explanation is to be preferred.” I believe my high school biology teacher taught the “KISS” method (Keep It Simple, Stupid) to convey a similar principle. As I was trying to get alerts set up in Mimir using the Grafana UI, I came across an issue that was only finally solved by going back to the simplest answer.

    The Original Problem

    One of the reasons I leaned towards Mimir as a single source of metrics is that it has its own ruler component. This would allow me to store alerts in, well, two places: the Mimir ruler for metrics rules, and the Loki ruler for log rules. Sure, I could use Grafana alerts, but I like being able to run the alerts in the native system.

    So, after getting Mimir running, I went to add a new alert in Grafana. Neither my Mimir nor Loki data source showed up.

    Now, as it turns out, the Loki one was easy: I had not enabled the Loki ruler component in the Helm chart. So, a quick update of the Loki configuration in my GitOps repo, and Loki was running. However, my Mimir data source was still missing.

    Asking for help

    I looked for some time at possible solutions, and posted to the GitHub discussion looking for assistance. Dimitar Dimitrov was kind enough to step and and give me some pointers, including looking at the Chrome network tab to figure out if there were any errors.

    At that point, I first wanted to bury my head in the sand, as, of all the things I looked at, that was not one of the things I had done. But, after getting over my embarrassment, I went about debugging. It looked like some errors around the Access-Control-Allow-Origin header not being available, although, there was also a 405 when attempting to make a preflight OPTIONS request.

    Dimitar suggested that I add that Access-Control-Allow-Origin via the Mimir Helm chart’s nginx.nginxConfig section. So I did, and but still had the same problems.

    I had previously tried this on Microsoft Edge and it had worked, and had been assuming it was just Chrome being overly strict. However, just for fun, I tried in a different Chrome profile, and it worked…

    Applying Occam’s Razor

    It was at this point that I thought “There is no way that something like this would not have come up with Grafana/Mimir on Chrome.” I would have expected a quick note around the Access-Control-Allow-Origin or something similar when hosting Grafana in a different subdomain than Mimir. So I took a longer look at the network log in my instance of Chrome that was still throwing errors.

    There was a redirect in there that was listed as “cached.” I thought, well, that’s odd, why would it cache that redirect? So, as a test, I disabled the cache in the Chrome debugging tools, refreshed the page, and viola! Mimir showed as a data source for alerts.

    Looking at the resulting success calls, I noted that ALL of the calls were proxied through grafana.mydomain.local, which made me think “Do I really even need the Access-Control-Allow-Origin headers? So, I removed those headers from my Mimir configuration, re-deployed, and tested with the caching disabled. It worked like a champ.

    What happened?

    The best answer I can come up with is that, at some point in my clicking around Grafana with Mimir as a data source, Chrome got a 301 redirect response from https://grafana.mydomain.local/api/datasources/proxy/14/api/v1/status/buildinfo, cached it, and used it in perpetuity. Disabled the response cache fixed everything, without the need to further configure Mimir’s Nginx proxy to return special CORS headers.

    So, with many thanks to Dimitar for being my rubber duck, I am now able to start adding new rules for monitoring my cluster metrics.

  • Walking away from Rancher

    I like to keep my home lab running for both hosting sites (like this one) and experimenting on different tools and techniques. It is pretty nice to be able to spin up some containers in a cluster without fear of disturbing other’s work. And, since I do not hold myself to any SLA commitments, it is typically stress-free.

    When I started my Kubernetes learning, having the Rancher UI at my side made it a little easier to see what was going on in the cluster. Sure, I could use kubectl for that, but having a user interface just made things a little nicer while I learned the tricks of kubectl.

    The past few upgrades, though, have not gone smoothly. I believe they changed the way they are handling their CRDs, and that has been messing with some of my Argo management of Rancher. But, as I went about researching a fix, I started to think…

    Do I really need Rancher?

    I came to realize that, well, I do not use Rancher’s UI nearly as often as I had. I rely quite heavily on the Rancher Kubernetes Engine (RKE) command line tool to manage my clusters, but, I do not really use any of the features that Rancher provides. Which features?

    • Ranchers RBAC is pretty useless for me considering that I am the only one connecting to the clusters, and I use direct connections (not through Rancher’s APIs).
    • Sure, I have single sign-on configured using Azure AD. But I do not think I have logged in to the UI in the last 6 months.
    • With my Argo CD setup, I manage resources by storing the definitions in a Git repository. That means I do not use Rancher for the add-on applications for observability and logging.

    This last upgrade cycle took the cake. I did my normal steps: upgrade the Rancher Helm version, commit, and let Argo do its thing. That’s when things blew up.

    First, the new rancher images are so big they cause an image pull error due to timeouts. So, after manually pulling the images to the nodes, the containers started up. The logs showed them working on the CRDs. The node would get through to the “m’s” of the CRD list before timing out and cycling.

    This wreaked havoc on ALL my clusters. I do not know if it was the constant thrashing to the disc or the server CPU, but things just started getting timing out. Once I realized Rancher was thrashing, and, well, I did not feel like fixing it, I started the process of deleting it.

    Removing Rancher

    After uninstalling the Rancher Helm chart, I ran the Rancher cleanup job that they provide. Even though I am using the Bitnami kube-prometheus chart to install Prometheus on each cluster, the cleanup job deleted all of my ServiceMonitor instances in each cluster. I had to go into Argo and re-sync the applications that were not set to auto-sync in order to recreate the ServiceMonitor instances.

    Benchmarking Before and After

    With Mimir in place, I had collected metrics from those clusters in a single place from both before and after the change. That made creating a few Grafana dashboard panels a lot easier.

    Cluster Memory and CPU – 10/18/2022 09:30 EDT to 10/21/2022 09:30 EDT

    The absolute craziness that started around 10:00 EDT on 10/19 and ended around 10:00 EDT on 10/20 can be attributed to me trying to upgrade Rancher, failing, and then starting to delete Rancher from the clusters. We will ignore that time period, as it is caused by human error.

    Looking at the CPU usage first, there were some unexpected behaviors. I expected a drop in the ops cluster, which was running the Rancher UI, but the drop in CPU in the production cluster was larger than that of the internal cluster. Meanwhile, nonproduction went up, which is really strange. I may dig into those numbers a bit more by individual pod later.

    What struck me more, was that, across all clusters, memory usage went up. I have no immediate answers to that one, but, you can be sure I will be digging in a bit more to figure that out.

    So, while Rancher was certainly using some CPU, it would seem that I am seeing increased memory usage without it. When I have answers to that, you’ll have answers to that.

  • Kubernetes Observability, Part 5 – Using Mimir for long-term metric storage

    This post is part of a series on observability in Kubernetes clusters:

    For anyone who actually reads through my ramblings, I am sure they are asking themselves this question: “Didn’t he say Thanos for long-term metric storage?” Yes, yes I did say Thanos. Based on some blog posts from VMWare that I had come across, my initial thought was to utilize Thanos as my long term metrics storage solution. Having worked with Loki first, though, I became familiar with the distributed deployment and storage configuration. So, when I found that Mimir has similar configuration and compatibility, I thought I’d give it a try.

    Additionally, remember, this is for my home lab: I do not have a whole lot of time for configuration and management. With that in mind, using the “Grafana Stack” seemed an expedited solution.

    Getting Started with Mimir

    As with Loki, I started with the mimir-distributed Helm chart that Grafana provides. The charts are well documented, including a Getting Started guide on their website. The Helm chart includes a Minio dependency, but, as I already setup Minio, I disabled the included chart and configured a new bucket for Mimir.

    As Mimir has all the APIs to stand in for Prometheus, the changes were pretty easy:

    1. Get an instance of Mimir installed in my internal cluster
    2. Configure my current Prometheus instances to remote-write to Mimir
    3. Add Mimir as a data source in Grafana.
    4. Change my Grafana dashboards to use Mimir, and modify those dashboards to filter based on the cluster label that is added.

    As my GitOps repositories are public, have a look at my Mimir-based chart for details on my configuration and deployment.

    Labels? What labels?

    I am using the Bitnami kube-prometheus Helm chart. That particular chart allows you to define external labels using the prometheus.externalLabels value. In my case, I created a cluster label with unique values for each of my clusters. This allows me to create dashboards with a single data source which can be filtered based on each cluster using dashboard variables.

    Well…. that was easy

    All in all, it took me probably two hours to get Mimir running and collecting data from Prometheus. It took far less time than I anticipated, and opened up some new doors to reduce my observability footprint, such as:

    1. I immediately reduced the data retention on my local Prometheus installations to three days. This reduced my total disk usage in persistent volumes by about 40 GB.
    2. I started researching using the Grafana Agent as a replacement for a full Prometheus instance in each cluster. Generally, the agent should use less CPU and storage on each cluster.
    3. Grafana Agent is also a replacement for Promtail, meaning I could remove both my kube-prometheus and promtail tools and replace them with an instance of the Grafana Agent. This greatly simplifies the configuration of observability within the cluster.

    So, what’s the catch? Well, I’m tying myself to an opinionated stack. Sure, it’s based on open standards and has the support of the Prometheus community, but, it remains to be seen what features will become “pay to play” within the stack. Near the bottom of this blog post is some indication of what I am worried about: while the main features are licensed under AGPLv3, there are additional features that come with proprietary licensing. In my case, for my home lab, these features are of no consequence, but, when it comes to making decisions for long term Kubernetes Observability at work, I wonder what proprietary features we will require and how much it will cost us in the long run.

  • Getting Synology SNMP data into Prometheus

    With my new cameras installed, I have been spending a lot more time in the Diskstation Manager (DSM). I always forget how much actually goes on within the Synology, and I am reminded of that every time I open the Resource Monitor.

    At some point, I started to wonder whether or not I could get this data into my metrics stack. A quick Google search brought me to the SNMP Exporter and the work Jinwoong Kim has done to generate a configuration file based on the Synology MIBs.

    It took a bit of trial and error to get the configuration into the chart, mostly because I forgot how to make a multiline string in a Go template. I used the prometheus-snmp-exporter chart as a base, and built out a chart in my GitOps repo to have this deployed to my internal monitoring cluster.

    After grabbing one of the community Grafana dashboards, this is what I got:

    Security…. Not just yet

    Apparently, SNMP has a few different versions. v1 and v2 are insecure, utilizing a community string for simple authentication. v3 has added a more complex authentication and encryption, but this makes configuration more complex.

    Now, my original intent was to enable SNMPv3 on the Synology, however, I quickly realized that meant putting the auth values for SNMP v3 in my public configuration repo, as I haven’t figured out how to include that configuration as a secret (instead of a configmap). So, just to get things running, I settled on just enabling V1/V2 with the community string. These SNMP ports are blocked from external traffic, so I’m not too concerned with it, but my to-do list now includes a migration to SNMPv3 for the Synology.

    When you have a hammer…

    … everything looks like a nail. The SNMP Exporter I am currently running only has configuration for Synology, but the default if_mib module that is packaged with the SNMP Exporter can grab data from a number of products which support it. So I find myself looking for “nails” as it were, or objects from which I can gather SNMP data. For the time being, I am content with the Synology data, but don’t be surprised if my network switch and server find themselves on the end of SNMP collection. I would say that my Unifi devices warrant it, but the Unifi Poller I have running takes care of gathering that data.

  • Camera 1, Camera 2…

    First: I hate using acronyms without definitions:

    • NAS – Network Attached Storage. Think “network hard drive”
    • SAN – Storage Area Network – Think “a network of hard drives”. Backblaze explains the differences nicely.
    • iSCSI Target – iSCSI is a way to mount a volume on a NAS or SAN to a server in a way that the server thinks the volume is local.

    When I purchased my Synology DS1517+ in February 2018, I was immediately impressed with the capabilities of the NAS. While the original purpose was to have redundant storage for photos and an iSCSI target for my home lab server, the Synology has quickly expanded its role in my home network.

    One day, as I was browsing the different DiskStation apps, Surveillance Station caught my eye. Prior to this, I had never really thought about getting a video surveillance system. Why?

    • I never liked the idea of one of the closed network systems, and the expense of that dedicated system made it tough to swallow.
    • I really did not like the idea of my video being “in the cloud.” Perhaps that is more paranoia talking, but when I am recording everything that happens around my house, I want a smaller threat surface area.

    So what changed my mind? After shelling out the coin for my Synology and the associated drives, I felt the need to get my money’s worth out of the system. Plus, the addition of a pool in the back yard makes me want to have some surveillance on it for liability purposes.

    Years in the making

    Let’s just say I had no urgency with this project. After purchasing the NAS, a bevy of personal and professional events came up, including, but not limited to, a divorce, a global pandemic, wedding planning (within a global pandemic), and home remodeling.

    When those events started to subside in late 2021, I started shopping different Synology-compatible cameras. As you can see, the list is extensive and not the easiest shopping list. So I defaulted to my personal technical guru (thanks Justin Lauck), who basically runs his own small business at home in addition to his day job, for his recommendations. He turned me on to these Reolink Outdoor Cameras. I pulled the trigger on those bullet cameras (see what I did there) in December of 2021.

    I had no desire to climb a ladder outside in the winter in Pittsburgh, so I spent some time in the winter preparing the wire runs inside. I already had an ethernet run to the second floor for one of my wireless access points, but decided it would be easiest to replace that run with two new lines of fresh Cat6. One went to the AP, the other to a small POE switch I put in the attic.

    The spring and summer brought some more major life events (just a high school graduation and a kid starting college… nothing major). The one day I tried to start, I realized that my ladder was not tall enough to get where I wanted to go. That, coupled with me absolutely not wanting to be 30′ up in the air, led me to delay a bit.

    Face your fears!

    Faced with potentially one of the last good weather weeks of the year, yesterday was as good a time as any to get a ladder and get to work. I rented a 32′ extension ladder from my local Sunbelt rentals and got to work. For whatever reason, I started with the highest point. The process was simple, albeit with a lot of “up and down” the ladder.

    1. Drill a small hole in the soffit, and feed my fish stick into the attic.
    2. Get up into the attic and find the fish stick.
    3. Tape the ethernet cable to the end of the stick.
    4. Get back up on the ladder and fish the cable outside.
    5. Install the camera
    6. Check the Reolink app on my phone (30′ up) to ensure the camera angle is as I want it to be.

    Multiple that process times 4 cameras (one on each corner of the house), and I’m done! Well, done with the hardware install.

    Setting up Surveillance Station

    After the cameras were installed and running, I started adding the cameras to Surveillance Station. The NAS comes with a license for two cameras, meaning I could only add two of my four cameras initially. I thought I could just purchase a digital code for a 4 pack, but realized that all the purchases send a physical card, meaning I have to wait for someone to ship me a card. Since I had some loyalty points at work, I ordered some Amazon gift cards which I’ll use towards my 4 camara license pack. So, for now, I only have two cameras in Surveillance Station.

    There are so many options for setup and configuration that for me to try to cover them all here would not do it justice. What I will say is that initial configuration was a breeze, and if you take the time to get one camera configured to your liking, you can easily copy settings from that camera to other cameras.

    As for settings, here’s just a sampling of what you can do:

    • Video Stream format: If your camera is compatible with Surveillance Station, you can assign streams from your camera to profiles in Surveillance Station. My Reolink’s support a low-quality stream (640×480) and a high-quality stream (2560×1440), and I can map those to different Surveillance Station profiles to change the quality of recording. For example, you can setup the Surveillance Station to record at low quality when no motion is detected, and high quality when motion is detected.
    • Recording Schedule: You can customize the recording schedule based on time of day.
    • On Screen Display Settings – You can control overlays (like date/time and camera name) to use the camera settings or via Surveillance Station. The latter is nice in that it allows you to easily copy settings across cameras.
    • Event Detection – Like On Screen Display, you can use the camera’s built-in event detection algorithm, or use Surveillance Station’s algorithm. This one I may test more: I have to assume that Surveillance Station would use more CPU on the Synology when this is set to use the Surveillance Station algorithm, as opposed to letting the camera hardware detect motion. For now, I am using the camera’s built-in algorithms.

    As I said, there are lots of additional features, including notifications, time lapse recording, and privacy screening. All in all, Surveillance Status turns your Synology into a fully functioning network video recorder.

    Going mobile

    It’s worth noting that the DS Cam app allows you to access your surveillance station video outside of your home, as well as register for push notifications to be sent to you phone for various events. The app itself is pretty easy to use, and since it uses the built-in Diskstation authentication, there is no need to duplicate user values across the systems.

    Future Expansion

    Once I get the necessary license for the other two cameras, I plan on letting the current setup ride to see what I get. However, I do have a plan for two more cameras:

    1. Inside my garage facing the garage door. This would allow me to catch anyone coming in via the garage, which, right now, I only get as a side view.
    2. On my shed facing the back of my home. This would give me a full view of the back of the house. This one, though, comes with a power issue that may end up getting solved with a small off-grid solar solution.

    Additionally, I am going to research how this can tie in with my home automation platform. It might be nice to have lights kick on when motion is detected outside.

    For now, though, I will be happy knowing that I am more closely monitoring what goes on outside my home.

  • Tech Tips – Upgrading your Argo cluster tools

    Moving my home lab to GitOps and ArgoCD has been, well, nearly invisible now. With the build pipelines I have in place, I’m able to work on my home projects without much thought to deploying changes to my clusters.

    My OCD, however, prevents me from running old versions. I really want to stay up-to-date when it comes to the tools I am using. This resulted in a periodic update of all of my Helm repositories to search for new versions.

    After getting tired of this “search and replace,” I wrote a small Powershell script which automatically updates my local Helm repositories, and then searches through the current directory (recursively) for chart.yaml files. If there are dependencies in that Chart.yaml, it will upgrade them to the latest version and save the Chart.yaml file.

    It does not, at the moment, automatically commit these changes, so I do have a manual step to confirm the upgrades. However, it is a much faster way to get changes staged for review and committing.

  • I’ll take ‘Compatibility Matrices’ for $1000, Alex…

    I have posted a lot on Kubernetes over the past weeks. I have covered a lot around tools for observability, monitoring, networking, and general usefulness. But what I ran into over the past few weeks is both infuriating and completely avoidable, with the right knowledge and time to keep up to speed.

    The problem

    I do my best to try and keep things up to date in the lab. Not necessarily bleeding edge, but I make an effort to check for new versions of my tools and update them. This includes the Kubernetes versions of my clusters.

    Using RKE, I am limited to what RKE supported. So I have been running 1.23.6 (or, more specifically, v1.23.6-rancher1-1) for at least a month or so.

    About two weeks ago, I updated my RKE command line and noticed that v1-23.8-rancher1-1 was available. So I changed the cluster.yml in my non-production environment and ran rke up. No errors, no problems, right?? So I made the change to the internal cluster and started the rke up for that cluster. As that was processing, however, I noticed that my non-production cluster was down. Like, the Kube API was running, so I could get the pod list. But every pod was erroring out. I did not notice anything in the pod events that was even remotely helpful, other than it couldn’t be scheduled. As I didn’t have time to properly diagnose, I attempted to roll the cluster back to 1.23.6. That worked, so I downgraded and left it alone.

    I will not let the machines win, so I stepped back into this problem today. I tried upgrading again (both to 1.23.8 and 1.24.2), with the same problems. In digging into the docker container logs on the hosts, I found the smoking gun:

    Unit kubepods-burstable.slice not found.

    Eh? I can say I have never seen that phrase before. But a quick Google search pointed me towards cgroup and, more generally, the docker runtime.

    Compatibility, you fickle fiend

    As it turns out, there is quite a large compatibility matrix between Rancher, RKE, Kubernetes, and Docker itself. My move to 1.23.8 clearly pushed my Kubernetes version past what my Docker version is (which, if you care, was Docker 20.10.12 on Ubuntu 22.04).

    I downgraded the non-production cluster once again, then upgraded Docker on those nodes (a simple sudo apt upgrade docker). Then I tried the upgrade, first to v1.23.8 (baby steps, folks).

    Success! Kubernetes version was upgraded, and all the pods restarted as expected. Throwing caution to the wind, I upgraded the cluster to 1.24.2. And, this time, no problems.

    Lessons Learned

    Kubernetes is a great orchestration tool. Coupled with the host of other CNCF tooling available, one can design and build robust application frameworks which let developers focus more on business code than deployment and scaling.

    But that robustness comes with a side of complexity. The host of container runtimes, Kubernetes versions, and third party hosts means cluster administration is, and will continue to be, a full time job. Just look at the Rancher Manager matrix at SUSE. Thinking about keeping production systems running while jugging the compatibility of all these different pieces makes me glad that I do not have to manage these beasts daily.

    Thankfully, cloud providers like Azure, GCP, and AWS, provide some respite by simplifying some of this. Their images almost force compatibility, so that one doesn’t run into the issues that I ran into on my home cluster. I am much more confident in their ability to run a production cluster than I am in my own skills as a Kubernetes admin.

  • Kubernetes Observability, Part 4 – Using Linkerd for Service Observability

    This post is part of a series on observability in Kubernetes clusters:

    As we start to look at traffic within our Kubernetes clusters, the notion of adding a service mesh crept into our discussions. I will not pretend to be an expert in service meshes, but the folks at bouyant.io (the creators of Linkerd) have done a pretty good job of explaining service meshes for engineers.

    My exercise to install Linkerd as a cluster was an exercise in “can I do it” more than having a need for a service mesh in place. However, Linkerd’s architecture is such that I can have Linkerd installed in the cluster, but only active on services that need it. This is accomplished via pod annotations, and make the system very configurable.

    Installing Linkerd

    With my ArgoCD setup, adding Linkerd as a cluster tool was pretty simple: I added the chart definition to the repository, then added a corresponding ApplicationSet definition. The ApplicationSet defined a cluster generator with a label match, meaning Linkerd would only be installed to clusters where I added spydersoft.io/linkerd=true as a label on the ArgoCD cluster secret.

    The most troublesome part of all of the installation process was figuring out how to manage Linkerd via GitOps. The folks at Linkerd, however, have a LOT of guides to help. You can review my chart definition for my installation methods, however, that was built from the following Linkerd articles:

    Many kudos to the Linkerd team, as their documentation was thorough and easy to follow.

    Adding Linkerd-viz

    Linkerd-viz is an add-on to Linkerd that has its own helm chart. As such, I manage it as a separate cluster tool. The visualization add-on has a dashboard that can be exposed via ingress and provide an overview of Linkerd and the metrics it is collecting. In my case, I tried to expose Linkerd-viz on a subpath (using my cluster’s internal domain name as the host). I ran into some issues (more on that below), but overall it works well.

    I broke it…

    As I started adding podAnnotations to inject Linkerd into my pods, things seemed to be “just working.” I even decorated my Nginx ingress controllers following the Linkerd guide, which meant traffic within my cluster was all going through Linkerd. This seemed to work well, until I tried to access my installation of Home Assistant. I spent a good while trying to debug, but as soon as I removed the pod annotations from Nginx, Home Assistant started working. While I am sure there is a way to fix that, I have not had much time to devout to the home lab recently, so that is on my to do list.

    I also noticed that the Linkerd-viz dashboard does not, at all, like to be hosted in a non-root URL. This has been documented as a bug in Linkerd, but is currently marked with the “help wanted” tag, so I am not expecting it to be fixed anytime soon. However, that bug identifies an ingress configuration snippet that can be added to the ingress definition to provide some basic rewrite functionality. It is a dirty workaround, and does not fix everything, but it is servicable.

    Benefits?

    For the pods that I have marked up, I can glance at the network traffic and latency between the services. I have started to create Grafana dashboards in my external instance to pull those metrics into an easy-to-read graphs for network performance.

    I have a lot more learning to do when it comes to Linkerd. While it is installed and running, I am certainly not using it for any heavy tasking. I hope to make some more time to investigate, but for now, I have some additional network metrics that help me understand what is going on in my clusters.

  • Kubernetes Observability, Part 3 – Dashboards with Grafana

    This post is part of a series on observability in Kubernetes clusters:

    What good is Loki’s log collection or Prometheus’ metrics scraping without a way to see it all? Both Loki and Prometheus are products born from Grafana Labs, which is building itself as an observability stack similar to Elastic. I have used both, and Grafana’s stack is much easier to get started with than Elastic. For my home lab, it is perfect: simple start and config, fairly easy monitoring, and no real administrative work to do.

    Installing Grafana

    The Helm chart provided for Grafana was very easy to use, and the “out-of-box” configuration was sufficient to get started. I configured a few more features to get the most out of my instance:

    • Using an existing secret for admin credentials: this secret is created by an ExternalSecret resource that pulls secrets from my local Hashicorp Vault instance.
    • Configuration for Azure AD users. Grafana’s documentation details additional actions that need done in your Azure AD instance. Note that the envFromSecret is the Kubernetes Secret that gets expanded to the environment, storing my Azure AD ClientID and Client Secret.
    • Added the grafana-piechart-panel plugin, as some of the dashboards I downloaded referenced that.
    • Enabled a Prometheus ServiceMonitor resource to scrape Grafana metrics.
    • Annotated the services and pods for Linkerd.

    Adding Data Sources

    With Grafana up and running, I added 5 data sources. Each of my clusters has their own Prometheus instance, and I added my Loki instance for logs. Eventually, Thanos will aggregate my Prometheus metrics into a single data source, but that is a topic for another day.

    Exploring

    The Grafana interface is fairly intuitive. The Explore section lets you poke around your data sources to review the data you are collecting. There are “builders” available to help you construct your PromQL (Prometheus Query Language) or LogQL (Log Query Language) queries. Querying Metrics automatically displays a line chart with your values, making it pretty easy to review your query results and prepare the query for inclusion in a dashboard.

    When debugging my applications, I use the Explore section almost exclusively to review incoming logs. A live log view and search with context makes it very easy to find warnings and errors within the log entries and determine issues.

    Building Dashboards

    In my limited use of the ELK stack, one thing that always got me was the barrier of entry into Kibana. I have always found that having examples to tinker with is a much easier way for me to learn, and I could never find good examples of Kibana dashboards that I could make my own.

    Grafana, however, has a pretty extensive list of community-built dashboards from which I could begin my learning. I started with some of the basics, like Cluster Monitoring for Kubernetes. And, well, even in that one, I ran into some issues. The current version in the community uses kubernetes_io_hostname as a label for the node name. It would seem that label has changed to node in kube-state-metrics, so I had to import that dashboard and make changes to the queries in order for the data to show.

    I found a few other dashboards that illustrated what I could do with Grafana:

    • ArgoCD – If you add a ServiceMonitor to ArgoCD, you can collect a number of metrics around application status and synchronizations, and this dashboard gives a great view of how ArgoCD is running.
    • Unifi Poller DashboardsUnifi Poller is an application that polls the Unifi controller to pull metrics and expose them to Prometheus in the OpenTelemetry standard. It includes dashboards for a number of different metrics.
    • NGINX Ingress – If you use NGINX for your Ingress controller, you can add a ServiceMonitor to scrape metrics on incoming network traffic.

    With these examples in hand, I have started to build out my own dashboards for some of my internal applications.

  • Kubernetes Observability, Part 2 – Collecting Metrics with Prometheus

    This post is part of a series on observability in Kubernetes clusters:

    Part 1 – Collecting Logs with Loki
    Part 2 – Collecting Metrics with Prometheus (this post)
    Part 3 – Building Grafana Dashboards
    Part 4 – Using Linkerd for Service Observability
    Part 5 – Using Mimir for long-term metric storage

    “Prometheus” appears in many Kubernetes blogs the same way that people whisper a famous person’s name as they enter a crowded room. Throughout a lot of my initial research, particularly with the k8s-at-home project, I kept seeing references to Prometheus in various Helm charts. Not wanting to distract myself, I usually just made sure it was disabled and skipped over it.

    As I found the time to invest in Kubernetes observability, gathering and viewing metrics became a requirement. So I did a bit of a deep dive into Prometheus and how I could use it to view metrics.

    The Short, Short Version

    By no means am I going to regurgitate the documentation around Prometheus. There is a lot to ingest around settings and configuration, and I have not even begun to scratch the surface around what I can do with Prometheus. My simple goal was to get Prometheus installed and running on my clusters, gathering metrics for my clusters. And I did this by installing Prometheus on each of my clusters using Bitnami‘s kube-prometheus Helm chart. Check out the cluster-tools folder in my Argo repository for an example.

    The Bitnami chart includes kube-state-metrics and node-exporter, in addition to settings for Thanos as a sidecar. So, with the kube-prometheus install up and running, I was able to create a new data source in Grafana and view the collected metrics. These metrics were fairly basic, consisting mainly of the metrics being reported by kube-state-metrics.

    Enabling ServiceMonitors

    Many of the charts I use to install cluster tools come with values sections that enabled metrics collection via a ServiceMonitor CRD. A ServiceMonitor instance instructs Prometheus how to discover and scrape services within the cluster.

    For example, I use the following Helm charts with ServiceMonitor definitions:

    So, in order to get metrics on these applications, I simply edited my values.yaml file and enabled the creation of a ServiceMonitor resource for each.

    Prometheus as a piece of the puzzle

    I know I skipped a LOT. In a single cluster environment, installing Prometheus should be as simple as choosing your preferred installation method and getting started scraping. As always, my multi-cluster home lab presents some questions that mirror questions I get at work. In this case, how do we scale out to manage multiple clusters?

    Prometheus allows for federation, meaning one Prometheus server can scrape other Prometheus servers to gather metrics. Through a lot of web searching, I came across an article from Vikram Vaswani and Juan Ariza centered on creating a multi-cluster monitoring dashboard using Prometheus, Grafana, and Thanos. The described solution is close to what I would like to see in my home lab. While I will touch on Grafana and Thanos in later articles, the key piece of this article was installing Prometheus in each cluster, and, in as described, creating a Thanos sidecar to aid in the implementation of Thanos.

    Why Bitnami?

    I have something of a love/hate relationship with Bitnami. The VMWare-owned property is a collection of various images and Helm charts meant to be the “app store” for your Kubernetes cluster. Well, more specifically, it is meant to be the app store for your VMWare Tanzu environments, but the Helm charts are published so that others can use them.

    Generally, I prefer using “official” charts for applications. This usually ensures that the versions are the most current, and the chart is typically free from the “bloat” that can sometimes happen in Bitnami charts, where they package additional sub-charts to make things easy.

    That is not to say that I do not use Bitnami charts at all. My WordPress installation uses the Bitnami chart, and it serves its purpose. However, being community-driven, the versions can lag a bit. I know the 6.x release of WordPress took a few weeks to make it from WordPress to Bitnami.

    For Prometheus, there are a few community-driven charts, but you may notice that Helm is not in the list of installation methods. This, coupled with the desire to implement Thanos later, led me to the kube-prometheus chart by Bitnami. You may have better mileage with one of the Prometheus Community charts, but for now I am sticking to the Bitnami chart.