Kubernetes Observability, Part 1 – Collecting Logs with Loki

This post is part of a series on observability in Kubernetes clusters:

Part 1 – Collecting Logs with Loki (this post)
Part 2 – Collecting Metrics with Prometheus
Part 3 – Dashboards with Grafana
Part 4 – Using Linkerd for Service Observability
Part 5 – Using Mimir for long-term metric storage

I have been spending an inordinate amount of time wrapping my head around Kubernetes observability for my home lab. Rather than consolidate all of this into a single post, I am going to break up my learnings into bite-sized chunks. We’ll start with collecting cluster logs.

The Problem

Good containers generate a lot of logs. Outside of getting into the containers via kubectl, logging is the primary mechanism for identifying what is happening within a particular container. We need a way to collect the logs from various containers and consolidate them in a single place.

My goal was to find a log aggregation solution that gives me insights into all the logs in the cluster, without needing special instrumentation.

The Candidates – ELK versus Loki

For a while now, I have been running an ELK (Elasticsearch/Logstash/Kibana) stack locally. My hobby projects utilized an ElasticSearch sink for Serilog to ship logs directly from those images to ElasticSearch. I found that I could install Filebeats into the cluster and ship all container logs to Elasticsearch, which allowed me to gather the logs across containers.

ELK

Elasticsearch is a beast. It’s capabilities are quite impressive as a document and index solution. But those capabilities make it really heavy for what I wanted, which was “a way to view logs across containers.”

For a while, I have been running an ELK instance on my internal tools cluster. It has served its purpose: I am able to report logs from Nginx via Filebeats, and my home containers are built with a Serilog sink that reports logs directly to elastic. I recently discovered how to install Filebeats onto my K8 clusters, which allows it to pull logs from the containers. This, however, exploded my Elastic instance.

Full disclosure: I’m no Elasticsearch administrator. Perhaps, with proper experience, I could make that work. But Elastic felt heavy, and I didn’t feel like I was getting value out of the data I was collecting.

Loki

A few of my colleagues found Grafana Loki as a potential solution for log aggregation. I attempted an installation to compare the solutions.

Loki is a log aggregation system which provides log storage and querying. It is not limited to Kubernetes: there are number of official clients for sending logs, as well as some unofficial third-party ones. Loki stores your incoming logs (see Storage below), creates indices on some of the log metadata, and provides a custom query language (LogQL) to allow you to explore your logs. Loki integrates with Grafana for visual log exploration, LogCLI for command line searches, and Prometheus AlertManager for routing alerts based on logs.

One of the clients, Promtail, can be installed on a cluster to scrape logs and report them back to Loki. My colleague suggested a Loki instance on each cluster. I found a few discussions in Grafana’s Github issues section around this, but it lead to a pivotal question.

Loki per Cluster or One Loki for All?

I laughed a little as I typed this, because the notion of “multiple Lokis” is explored in decidedly different context in the Marvel series. My options were less exciting: do I have one instance of Loki that collects data from different clients across the clusters, or do I allow each cluster to have it’s own instance of Loki, and use Grafana to attach to multiple data sources?

Why consider Loki on every cluster?

Advantages

Decreased network chatter – If every cluster has a local Loki instance, then logs for that cluster do not have far to go, which means minimal external network traffic.

Localized Logs – Each cluster would be responsible for storing its own log information, so finding logs for a particular cluster is as simple as going to the cluster itself

Disadvantages

Non-centralized – The is no way to query logs across clusters without some additional aggregator (like another Loki instance). This would cause duplicative data storage

Maintenance Overhead – Each cluster’s Loki instance must be managed individually. This can be automated using ArgoCD and the cluster generator, but it still means that every cluster has to run Loki.

Final Decision?

For my home lab, Loki fits the bill. The installation was easy-ish (if you are familiar with Helm and not afraid of community forums), and once configured, it gave me the flexibility I needed with easy, declarative maintenance. But, which deployment method?

Honestly, I was leaning a bit towards the individual Loki instances. So much so that I configured Loki as a cluster tool and deployed it to all of my instances. And that worked, although swapping around Grafana data sources for various logs started to get tedious. And, when I thought about where I should report logs for other systems (like my Raspberry PI-based Nginx proxy), doubts crept in.

Thinking about using an ELK stack, I certainly would not put an instance of Elasticsearch on every cluster. While Loki is a little lighter than elastic, it’s still heavy enough that it’s worthy of a single, properly configured instance. So I removed the cluster-specific Loki instances and configured one instance.

With promtail deployed via an ArgoCD ApplicationSet with a cluster generator, I was off to the races.

A Note on Storage

Loki has a few storage options, with the majority being cloud-based. At work, with storage options in both Azure and GCP, this is a non-issue. But for my home lab, well, I didn’t want to burn cash storing logs when I have a perfectly good SAN sitting at home.

My solution there was to stand up an instance of MinIO to act as S3 storage for Loki. Now, could I have run MinIO on Kubernetes? Sure. But, in all honesty, it got pretty confusing pretty quickly, and I was more interested in getting Loki running. So I spun up a Hyper-V machine with Ubuntu 22.04 and started running MinIO. Maybe one day I’ll work on getting MinIO running on one of my K8 clusters, but for now, the single machine works just fine.