Tag: Grafana

  • Kubernetes Observability, Part 2 – Collecting Metrics with Prometheus

    This post is part of a series on observability in Kubernetes clusters:

    Part 1 – Collecting Logs with Loki
    Part 2 – Collecting Metrics with Prometheus (this post)
    Part 3 – Building Grafana Dashboards
    Part 4 – Using Linkerd for Service Observability
    Part 5 – Using Mimir for long-term metric storage

    “Prometheus” appears in many Kubernetes blogs the same way that people whisper a famous person’s name as they enter a crowded room. Throughout a lot of my initial research, particularly with the k8s-at-home project, I kept seeing references to Prometheus in various Helm charts. Not wanting to distract myself, I usually just made sure it was disabled and skipped over it.

    As I found the time to invest in Kubernetes observability, gathering and viewing metrics became a requirement. So I did a bit of a deep dive into Prometheus and how I could use it to view metrics.

    The Short, Short Version

    By no means am I going to regurgitate the documentation around Prometheus. There is a lot to ingest around settings and configuration, and I have not even begun to scratch the surface around what I can do with Prometheus. My simple goal was to get Prometheus installed and running on my clusters, gathering metrics for my clusters. And I did this by installing Prometheus on each of my clusters using Bitnami‘s kube-prometheus Helm chart. Check out the cluster-tools folder in my Argo repository for an example.

    The Bitnami chart includes kube-state-metrics and node-exporter, in addition to settings for Thanos as a sidecar. So, with the kube-prometheus install up and running, I was able to create a new data source in Grafana and view the collected metrics. These metrics were fairly basic, consisting mainly of the metrics being reported by kube-state-metrics.

    Enabling ServiceMonitors

    Many of the charts I use to install cluster tools come with values sections that enabled metrics collection via a ServiceMonitor CRD. A ServiceMonitor instance instructs Prometheus how to discover and scrape services within the cluster.

    For example, I use the following Helm charts with ServiceMonitor definitions:

    So, in order to get metrics on these applications, I simply edited my values.yaml file and enabled the creation of a ServiceMonitor resource for each.

    Prometheus as a piece of the puzzle

    I know I skipped a LOT. In a single cluster environment, installing Prometheus should be as simple as choosing your preferred installation method and getting started scraping. As always, my multi-cluster home lab presents some questions that mirror questions I get at work. In this case, how do we scale out to manage multiple clusters?

    Prometheus allows for federation, meaning one Prometheus server can scrape other Prometheus servers to gather metrics. Through a lot of web searching, I came across an article from Vikram Vaswani and Juan Ariza centered on creating a multi-cluster monitoring dashboard using Prometheus, Grafana, and Thanos. The described solution is close to what I would like to see in my home lab. While I will touch on Grafana and Thanos in later articles, the key piece of this article was installing Prometheus in each cluster, and, in as described, creating a Thanos sidecar to aid in the implementation of Thanos.

    Why Bitnami?

    I have something of a love/hate relationship with Bitnami. The VMWare-owned property is a collection of various images and Helm charts meant to be the “app store” for your Kubernetes cluster. Well, more specifically, it is meant to be the app store for your VMWare Tanzu environments, but the Helm charts are published so that others can use them.

    Generally, I prefer using “official” charts for applications. This usually ensures that the versions are the most current, and the chart is typically free from the “bloat” that can sometimes happen in Bitnami charts, where they package additional sub-charts to make things easy.

    That is not to say that I do not use Bitnami charts at all. My WordPress installation uses the Bitnami chart, and it serves its purpose. However, being community-driven, the versions can lag a bit. I know the 6.x release of WordPress took a few weeks to make it from WordPress to Bitnami.

    For Prometheus, there are a few community-driven charts, but you may notice that Helm is not in the list of installation methods. This, coupled with the desire to implement Thanos later, led me to the kube-prometheus chart by Bitnami. You may have better mileage with one of the Prometheus Community charts, but for now I am sticking to the Bitnami chart.

  • Kubernetes Observability, Part 1 – Collecting Logs with Loki

    This post is part of a series on observability in Kubernetes clusters:

    I have been spending an inordinate amount of time wrapping my head around Kubernetes observability for my home lab. Rather than consolidate all of this into a single post, I am going to break up my learnings into bite-sized chunks. We’ll start with collecting cluster logs.

    The Problem

    Good containers generate a lot of logs. Outside of getting into the containers via kubectl, logging is the primary mechanism for identifying what is happening within a particular container. We need a way to collect the logs from various containers and consolidate them in a single place.

    My goal was to find a log aggregation solution that gives me insights into all the logs in the cluster, without needing special instrumentation.

    The Candidates – ELK versus Loki

    For a while now, I have been running an ELK (Elasticsearch/Logstash/Kibana) stack locally. My hobby projects utilized an ElasticSearch sink for Serilog to ship logs directly from those images to ElasticSearch. I found that I could install Filebeats into the cluster and ship all container logs to Elasticsearch, which allowed me to gather the logs across containers.

    ELK

    Elasticsearch is a beast. It’s capabilities are quite impressive as a document and index solution. But those capabilities make it really heavy for what I wanted, which was “a way to view logs across containers.”

    For a while, I have been running an ELK instance on my internal tools cluster. It has served its purpose: I am able to report logs from Nginx via Filebeats, and my home containers are built with a Serilog sink that reports logs directly to elastic. I recently discovered how to install Filebeats onto my K8 clusters, which allows it to pull logs from the containers. This, however, exploded my Elastic instance.

    Full disclosure: I’m no Elasticsearch administrator. Perhaps, with proper experience, I could make that work. But Elastic felt heavy, and I didn’t feel like I was getting value out of the data I was collecting.

    Loki

    A few of my colleagues found Grafana Loki as a potential solution for log aggregation. I attempted an installation to compare the solutions.

    Loki is a log aggregation system which provides log storage and querying. It is not limited to Kubernetes: there are number of official clients for sending logs, as well as some unofficial third-party ones. Loki stores your incoming logs (see Storage below), creates indices on some of the log metadata, and provides a custom query language (LogQL) to allow you to explore your logs. Loki integrates with Grafana for visual log exploration, LogCLI for command line searches, and Prometheus AlertManager for routing alerts based on logs.

    One of the clients, Promtail, can be installed on a cluster to scrape logs and report them back to Loki. My colleague suggested a Loki instance on each cluster. I found a few discussions in Grafana’s Github issues section around this, but it lead to a pivotal question.

    Loki per Cluster or One Loki for All?

    I laughed a little as I typed this, because the notion of “multiple Lokis” is explored in decidedly different context in the Marvel series. My options were less exciting: do I have one instance of Loki that collects data from different clients across the clusters, or do I allow each cluster to have it’s own instance of Loki, and use Grafana to attach to multiple data sources?

    Why consider Loki on every cluster?

    Advantages

    Decreased network chatter – If every cluster has a local Loki instance, then logs for that cluster do not have far to go, which means minimal external network traffic.

    Localized Logs – Each cluster would be responsible for storing its own log information, so finding logs for a particular cluster is as simple as going to the cluster itself

    Disadvantages

    Non-centralized – The is no way to query logs across clusters without some additional aggregator (like another Loki instance). This would cause duplicative data storage

    Maintenance Overhead – Each cluster’s Loki instance must be managed individually. This can be automated using ArgoCD and the cluster generator, but it still means that every cluster has to run Loki.

    Final Decision?

    For my home lab, Loki fits the bill. The installation was easy-ish (if you are familiar with Helm and not afraid of community forums), and once configured, it gave me the flexibility I needed with easy, declarative maintenance. But, which deployment method?

    Honestly, I was leaning a bit towards the individual Loki instances. So much so that I configured Loki as a cluster tool and deployed it to all of my instances. And that worked, although swapping around Grafana data sources for various logs started to get tedious. And, when I thought about where I should report logs for other systems (like my Raspberry PI-based Nginx proxy), doubts crept in.

    Thinking about using an ELK stack, I certainly would not put an instance of Elasticsearch on every cluster. While Loki is a little lighter than elastic, it’s still heavy enough that it’s worthy of a single, properly configured instance. So I removed the cluster-specific Loki instances and configured one instance.

    With promtail deployed via an ArgoCD ApplicationSet with a cluster generator, I was off to the races.

    A Note on Storage

    Loki has a few storage options, with the majority being cloud-based. At work, with storage options in both Azure and GCP, this is a non-issue. But for my home lab, well, I didn’t want to burn cash storing logs when I have a perfectly good SAN sitting at home.

    My solution there was to stand up an instance of MinIO to act as S3 storage for Loki. Now, could I have run MinIO on Kubernetes? Sure. But, in all honesty, it got pretty confusing pretty quickly, and I was more interested in getting Loki running. So I spun up a Hyper-V machine with Ubuntu 22.04 and started running MinIO. Maybe one day I’ll work on getting MinIO running on one of my K8 clusters, but for now, the single machine works just fine.