Kubernetes Observability, Part 2 – Collecting Metrics with Prometheus

This post is part of a series on observability in Kubernetes clusters:

Part 1 – Collecting Logs with Loki
Part 2 – Collecting Metrics with Prometheus (this post)
Part 3 – Building Grafana Dashboards
Part 4 – Using Linkerd for Service Observability
Part 5 – Using Mimir for long-term metric storage

“Prometheus” appears in many Kubernetes blogs the same way that people whisper a famous person’s name as they enter a crowded room. Throughout a lot of my initial research, particularly with the k8s-at-home project, I kept seeing references to Prometheus in various Helm charts. Not wanting to distract myself, I usually just made sure it was disabled and skipped over it.

As I found the time to invest in Kubernetes observability, gathering and viewing metrics became a requirement. So I did a bit of a deep dive into Prometheus and how I could use it to view metrics.

The Short, Short Version

By no means am I going to regurgitate the documentation around Prometheus. There is a lot to ingest around settings and configuration, and I have not even begun to scratch the surface around what I can do with Prometheus. My simple goal was to get Prometheus installed and running on my clusters, gathering metrics for my clusters. And I did this by installing Prometheus on each of my clusters using Bitnami‘s kube-prometheus Helm chart. Check out the cluster-tools folder in my Argo repository for an example.

The Bitnami chart includes kube-state-metrics and node-exporter, in addition to settings for Thanos as a sidecar. So, with the kube-prometheus install up and running, I was able to create a new data source in Grafana and view the collected metrics. These metrics were fairly basic, consisting mainly of the metrics being reported by kube-state-metrics.

Enabling ServiceMonitors

Many of the charts I use to install cluster tools come with values sections that enabled metrics collection via a ServiceMonitor CRD. A ServiceMonitor instance instructs Prometheus how to discover and scrape services within the cluster.

For example, I use the following Helm charts with ServiceMonitor definitions:

So, in order to get metrics on these applications, I simply edited my values.yaml file and enabled the creation of a ServiceMonitor resource for each.

Prometheus as a piece of the puzzle

I know I skipped a LOT. In a single cluster environment, installing Prometheus should be as simple as choosing your preferred installation method and getting started scraping. As always, my multi-cluster home lab presents some questions that mirror questions I get at work. In this case, how do we scale out to manage multiple clusters?

Prometheus allows for federation, meaning one Prometheus server can scrape other Prometheus servers to gather metrics. Through a lot of web searching, I came across an article from Vikram Vaswani and Juan Ariza centered on creating a multi-cluster monitoring dashboard using Prometheus, Grafana, and Thanos. The described solution is close to what I would like to see in my home lab. While I will touch on Grafana and Thanos in later articles, the key piece of this article was installing Prometheus in each cluster, and, in as described, creating a Thanos sidecar to aid in the implementation of Thanos.

Why Bitnami?

I have something of a love/hate relationship with Bitnami. The VMWare-owned property is a collection of various images and Helm charts meant to be the “app store” for your Kubernetes cluster. Well, more specifically, it is meant to be the app store for your VMWare Tanzu environments, but the Helm charts are published so that others can use them.

Generally, I prefer using “official” charts for applications. This usually ensures that the versions are the most current, and the chart is typically free from the “bloat” that can sometimes happen in Bitnami charts, where they package additional sub-charts to make things easy.

That is not to say that I do not use Bitnami charts at all. My WordPress installation uses the Bitnami chart, and it serves its purpose. However, being community-driven, the versions can lag a bit. I know the 6.x release of WordPress took a few weeks to make it from WordPress to Bitnami.

For Prometheus, there are a few community-driven charts, but you may notice that Helm is not in the list of installation methods. This, coupled with the desire to implement Thanos later, led me to the kube-prometheus chart by Bitnami. You may have better mileage with one of the Prometheus Community charts, but for now I am sticking to the Bitnami chart.