Tag: Grafana

  • Simplifying Internal Routing

    Centralizing Telemetry with Linkerd Multi-Cluster

    Running multiple Kubernetes clusters is great until you realize your telemetry traffic is taking an unnecessarily complicated path. Each cluster had its own Grafana Alloy instance dutifully collecting metrics, logs, and traces—and each one was routing through an internal Nginx reverse proxy to reach the centralized observability platform (Loki, Mimir, and Tempo) running in my internal cluster.

    This worked, but it had that distinct smell of “technically functional” rather than “actually good.” Traffic was staying on the internal network (thanks to a shortcut DNS entry that bypassed Cloudflare), but why route through an Nginx proxy when the clusters could talk directly to each other? Why maintain those external service URLs when all my clusters are part of the same infrastructure?

    Linkerd multi-cluster seemed like the obvious answer for establishing direct cluster-to-cluster connections, but the documentation leaves a lot unsaid when you’re dealing with on-premises clusters without fancy load balancers. Here’s how I made it work.

    The Problem: Telemetry Taking the Scenic Route

    My setup looked like this:

    Internal cluster: Running Loki, Mimir, and Tempo behind an Nginx gateway

    Production cluster: Grafana Alloy sending telemetry to loki.mattgerega.net, mimir.mattgerega.net, etc.

    Nonproduction cluster: Same deal, different tenant ID

    Every metric, log line, and trace span was leaving the cluster, hitting the Nginx reverse proxy, and finally making it to the monitoring services—which were running in a cluster on the same physical network. The inefficiency was bothering me more than it probably should have.

    This meant:

    – An unnecessary hop through the Nginx proxy layer

    – Extra TLS handshakes that didn’t add security value between internal services

    – DNS resolution for external service names when direct cluster DNS would suffice

    – One more component in the path that could cause issues

    The Solution: Hub-and-Spoke with Linkerd Multi-Cluster

    Linkerd’s multi-cluster feature does exactly what I needed: it mirrors services from one cluster into another, making them accessible as if they were local. The service mesh handles all the mTLS authentication, routing, and connection management behind the scenes. From the application’s perspective, you’re just calling a local Kubernetes service.

    For my setup, a hub-and-spoke topology made the most sense. The internal cluster acts as the hub—it runs the Linkerd gateway and hosts the actual observability services (Loki, Mimir, and Tempo). The production and nonproduction clusters are spokes—they link to the internal cluster and get mirror services that proxy requests back through the gateway.

    The beauty of this approach is that only the hub needs to run a gateway. The spoke clusters just run the service mirror controller, which watches for exported services in the hub and automatically creates corresponding proxy services locally. No complex mesh federation, no VPN tunnels, just straightforward service-to-service communication over mTLS.

    Gateway Mode vs. Flat Network

    (Spoiler: Gateway Mode Won)

    Linkerd offers two approaches for multi-cluster communication:

    Flat Network Mode: Assumes pod networks are directly routable between clusters. Great if you have that. I don’t. My three clusters each have their own pod CIDR ranges with no interconnect.

    Gateway Mode: Routes cross-cluster traffic through a gateway pod that handles the network translation. This is what I needed, but it comes with some quirks when you’re running on-premises without a cloud load balancer.

    The documentation assumes you’ll use a LoadBalancer service type, which automatically provisions an external IP. On-premises? Not so much. I went with NodePort instead, exposing the gateway on port 30143.

    The Configuration: Getting the Helm Values Right

    Here’s what the internal cluster’s Linkerd multi-cluster configuration looks like:

    linkerd-multicluster:
      gateway:
        enabled: true
        port: 4143
        serviceType: NodePort
        nodePort: 30143
        probe:
          port: 4191
          nodePort: 30191
    
      # Grant access to service accounts from other clusters
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-production,linkerd-service-mirror-remote-access-nonproduction

    And for the production/nonproduction clusters:

    linkerd-multicluster:
      gateway:
        enabled: false  # No gateway needed here
    
      remoteMirrorServiceAccountName: linkerd-service-mirror-remote-access-in-cluster-local

    The Link: Connecting Clusters Without Auto-Discovery

    Creating the cluster link was where things got interesting. The standard command assumes you want auto-discovery:

    linkerd multicluster link --cluster-name internal --gateway-addresses internal.example.com:30143

    But that command tries to do DNS lookups on the combined hostname+port string, which fails spectacularly. The fix was simple once I found it:

    linkerd multicluster link \
      --cluster-name internal \
      --gateway-addresses tfx-internal.gerega.net \
      --gateway-port 30143 \
      --gateway-probe-port 30191 \
      --api-server-address https://cp-internal.gerega.net:6443 \
      --context=internal | kubectl apply -f - --context=production

    Separating --gateway-addresses and --gateway-port made all the difference.

    I used DNS (tfx-internal.gerega.net) instead of hard-coded IPs for the gateway address. This is an internal DNS entry that round-robins across all agent node IPs in the internal cluster. The key advantage: when I cycle nodes (stand up new ones and destroy old ones), the DNS entry is maintained automatically. No manual updates to cluster links, no stale IP addresses, no coordination headaches—the round-robin DNS just picks up the new node IPs and drops the old ones.

    Service Export: Making Services Visible Across Clusters

    Linkerd doesn’t automatically mirror every service. You have to explicitly mark which services should be exported using the mirror.linkerd.io/exported: "true" label.

    For the Loki gateway (and similarly for Mimir and Tempo):

    gateway:
      service:
        labels:
          mirror.linkerd.io/exported: "true"

    Once the services were exported, they appeared in the production and nonproduction clusters with an `-internal` suffix:

    loki-gateway-internal.monitoring.svc.cluster.local

    mimir-gateway-internal.monitoring.svc.cluster.local

    tempo-gateway-internal.monitoring.svc.cluster.local

    Grafana Alloy: Switching to Mirrored Services

    The final piece was updating Grafana Alloy’s configuration to use the mirrored services instead of the external URLs. Here’s the before and after for Loki:

    Before:

    loki.write "default" {
      endpoint {
        url = "https://loki.mattgerega.net/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    After:

    loki.write "default" {
      endpoint {
        url = "http://loki-gateway-internal.monitoring.svc.cluster.local/loki/api/v1/push"
        tenant_id = "production"
      }
    }

    No more TLS, no more public DNS, no more reverse proxy hops. Just a direct connection through the Linkerd gateway.

    But wait—there’s one more step.

    The Linkerd Injection Gotcha

    Grafana Alloy pods need to be part of the Linkerd mesh to communicate with the mirrored services. Without the Linkerd proxy sidecar, the pods can’t authenticate with the gateway’s mTLS requirements.

    This turned into a minor debugging adventure because I initially placed the `podAnnotations` at the wrong level in the Helm values. The Grafana Alloy chart is a wrapper around the official chart, which means the structure is:

    alloy:
      controller:  # Not alloy.alloy!
        podAnnotations:
          linkerd.io/inject: enabled
      alloy:
        # ... other config

    Once that was fixed and the pods restarted, they came up with 3 containers instead of 2:

    – `linkerd-proxy` (the magic sauce)

    – `alloy` (the telemetry collector)

    – `config-reloader` (for hot config reloads)

    Checking the gateway logs confirmed traffic was flowing:

    INFO inbound:server:gateway{dst=loki-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.5.4:8080
    INFO inbound:server:gateway{dst=mimir-gateway.monitoring.svc.cluster.local:80}: Adding endpoint addr=10.42.9.18:8080
    INFO inbound:server:gateway{dst=tempo-gateway.monitoring.svc.cluster.local:4317}: Adding endpoint addr=10.42.10.13:4317

    Known Issues: Probe Health Checks

    There’s one quirk worth mentioning: the multi-cluster probe health checks don’t work in NodePort mode. The service mirror controller tries to check the gateway’s health endpoint and reports it as unreachable, even though service mirroring works perfectly.

    From what I can tell, this is because the health check endpoint expects to be accessed through the gateway service, but NodePort doesn’t provide the same service mesh integration as a LoadBalancer. The practical impact? None. Services mirror correctly, traffic routes successfully, mTLS works. The probe check just complains in the logs.

    What I Learned

    1. Gateway mode is essential for non-routable pod networks. If your clusters don’t have a CNI that supports cross-cluster routing, gateway mode is the way to go.

    2. NodePort works fine for on-premises gateways. You don’t need a LoadBalancer if you’re willing to manage DNS.

    3. DNS beats hard-coded IPs. Using `tfx-internal.gerega.net` means I can recreate nodes without updating cluster links.

    4. Service injection is non-negotiable. Pods must be part of the Linkerd mesh to access mirrored services. No injection, no mTLS, no connection.

    5. Helm values hierarchies are tricky. Always check the chart templates when podAnnotations aren’t applying. Wrapper charts add extra nesting.

    The Result

    Telemetry now flows directly from production and nonproduction clusters to the internal observability stack through Linkerd’s multi-cluster gateway—all authenticated via mTLS, bypassing the Nginx reverse proxy entirely.

    I didn’t reduce the number of monitoring stacks (each cluster still runs Grafana Alloy for collection), but I simplified the routing by using direct cluster-to-cluster connections instead of going through the Nginx proxy layer. No more proxy hops. No more external service DNS. Just three Kubernetes clusters talking to each other the way they should have been all along.

    The full configuration is in the ops-argo and ops-internal-cluster repositories, managed via ArgoCD ApplicationSets. Because if there’s one thing I’ve learned, it’s that GitOps beats manual kubectl every single time.

  • Automating Grafana Backups

    After a few data loss events, I took the time to automate my Grafana backups.

    A bit of instability

    It has been almost a year since I moved to a MySQL backend for Grafana. In that year, I’ve gotten a corrupted MySQL database twice now, forcing me to restore from a backup. I’m not sure if it is due to my setup or bad luck, but twice in less than a year is too much.

    In my previous post, I mentioned the Grafana backup utility as a way to preserve this data. My short-sightedness prevented me from automating those backups, however, so I suffered some data loss. After the most recent event, I revisited the backup tool.

    Keep your friends close…

    My first thought was to simply write a quick Azure DevOps pipeline to pull the tool down, run a backup, and copy it to my SAN. I would have also had to have included some scripting to clean up old backups.

    As I read through the grafana-backup-tool documents, though, I came across examples of running the tool as a Job in Kubernetes via a CronJob. This presented a very unique opportunity: configure the backup job as part of the Helm chart.

    What would that look like? Well, I do not install any external charts directly. They are configured as dependencies for charts of my own. Now, usually, that just means a simple values file that sets the properties on the dependency. In the case of Grafana, though, I’ve already used this functionality to add two dependent charts (Grafana and MySQL) to create one larger application.

    This setup also allows me to add additional templates to the Helm chart to create my own resources. I added two new resources to this chart:

    1. grafana-backup-cron – A definition for the cronjob, using the ysde/grafana-backup-tool image.
    2. grafana-backup-secret-es – An ExternalSecret definition to pull secrets from Hashicorp Vault and create a Secret for the job.

    Since this is all built as part of the Grafana application, the secrets for Grafana were already available. I went so far as to add a section in the values file for the backup. This allowed me to enable/disable the backup and update the image tag easily.

    Where to store it?

    The other enhancement I noticed in the backup tool was the ability to store files in S3 compatible storage. In fact, their example showed how to connect to a MinIO instance. As fate would have it, I have a MinIO instance running on my SAN already.

    So I configured a new bucket in my MinIO instance, added a new access key, and configured those secrets in Vault. After committing those changes and synchronizing in ArgoCD, the new resources were there and ready.

    Can I test it?

    Yes I can. Google, once again, pointed me to a way to create a Job from a CronJob:

    kubectl create job --from=cronjob/<cronjob-name> <job-name> -n <namespace-name>

    I ran the above command to create a test job. And, viola, I have backup files in MinIO!

    Cleaning up

    Unfortunately, there doesn’t seem to be a retention setting in the backup tool. It looks like I’m going to have to write some code to clean up my Grafana backups bucket, especially since I have daily backups scheduled. Either that, or look at this issue and see if I can add it to the tool. Maybe I’ll brush off my Python skills…

  • Re-configuring Grafana Secrets

    I recently fixed some synchronization issues that had been silently plaguing some of the monitoring applications I had installed, including my Loki/Grafana/Tempo/Mimir stack. Now that the applications are being updated, I ran into an issue with the latest Helm chart’s handling of secrets.

    Sync Error?

    After I made the change to fix synchronization of the Helm charts, I went to sync my Grafana chart, but received a sync error:

    Error: execution error at (grafana/charts/grafana/templates/deployment.yaml:36:28): Sensitive key 'database.password' should not be defined explicitly in values. Use variable expansion instead.

    I certainly didn’t change anything in those files, and I am already using variable expansion in the values.yaml file anyway. What does that mean? Basically, in the values.yaml file, I used ${ENV_NAME} in areas where I had a secret value, and told Grafana to expand environment variables into the configuration.

    The latest version of Helm doesn’t seem to like this. It views ANY value in secret fields to be bad. A search of the Grafana Helm Chart repo’s issues list yielded someone with a similar issue and a comment with a link to another comment that is the recommended solution.

    Same Secret, New Name

    After reading through the comment’s suggestion and Grafana’s documentation on overriding configuration with environment variables, I realized the fix was pretty easy.

    I already had a Kubernetes secret being populated from Hashicorp Vault with my secret values. I also already had envFromSecret set in the values.yaml to instruct the chart to use my secret. And, through some dumb luck, two of the three values were already named using the standards in Grafana’s documentation.

    So the “fix” was to simply remove the secret expansions from the values.yaml file, and rename one of the secretKey values so that it matched Grafana’s environment variable template. You can see the diff of the change in my Github repository.

    With that change, the Helm chart generated correctly, and once Argo had the changes in place, everything was up and running.

  • Synced, But Not: ArgoCD Differencing Configuration

    Some of the charts in my Loki/Grafana/Tempo/Mimir stack have an odd habit of not updating correctly in ArgoCD. I finally got tired of it and fixed it… I’m just not 100% sure how.

    Ignoring Differences

    At some point in the past, I had customized a few of my Application objects with ignoreDifferences settings. It was meant to tell ArgoCD to ignore fields that are managed by other things, and could change from the chart definition.

    Like what, you might ask? Well, the external-secrets chart generates it’s own caBundle and sets properties on a ValidatingWebhookConfiguration object. Obviously, that’s managed by the controller, and I don’t want to mess with it. However, I also don’t want ArgoCD to report that chart as Out of Sync all the time.

    So, as an example, my external-secrets application looks like this:

    project: cluster-tools
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-argo'
      path: cluster-tools/tools/external-secrets
      targetRevision: main
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: external-secrets
    syncPolicy:
      syncOptions:
        - CreateNamespace=true
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: '*'
        kind: '*'
        managedFieldsManagers:
          - external-secrets

    And that worked just fine. But, with my monitor stack, well, I think I made a boo-boo.

    Ignoring too much

    When I looked at the application differences for some of my Grafana resources, I noticed that the live vs desired image was wrong. My live image was older than the desired one, and yet, the application wasn’t showing as out of sync.

    At this point, I suspected ignoreDifferences was the issue, so I looked at the Application manifest. For some reason, my monitoring applications had an Application manifest that looked like this:

    project: external-apps
    source:
      repoURL: 'https://github.com/spydersoft-consulting/ops-internal-cluster'
      path: external/monitor/grafana
      targetRevision: main
      helm:
        valueFiles:
          - values.yaml
        version: v3
    destination:
      server: 'https://kubernetes.default.svc'
      namespace: monitoring
    syncPolicy:
      syncOptions:
        - RespectIgnoreDifferences=true
    ignoreDifferences:
      - group: "*"
        kind: "*"
        managedFieldsManagers:
        - argocd-controller
      - group: '*'
        kind: StatefulSet
        jsonPointers:
          - /spec/persistentVolumeClaimRetentionPolicy
          - /spec/template/metadata/annotations/'kubectl.kubernetes.io/restartedAt'

    Notices the part where I am ignoring managed fields from argocd-controller. I have no idea why I added that, but, it looks a little “all inclusive” for my tastes, and it was ONLY present in the ApplicationSet for my LGTM stack. So I commented it out.

    Now We’re Cooking!

    Lo and behold, ArgoCD looked at my monitoring stack and said “well, you have some updates, don’t you!” I spent the next few minutes syncing those applications individually. Why? There are a lot of hard working pods in those applications, I don’t like to cycle them all at once.

    I searched through my posts and some of my notes, and I honestly have no idea why I decided I should ignore all fields managed by argocd-controller. Needless to say, I will not be doing that again.

  • Maturing my Grafana setup

    I may have lost some dashboards and configuration recently, and it got me thinking about how to mature my Grafana setup for better persistence.

    Initial Setup

    When I first got Grafana running, it was based on the packaged Grafana Helm chart. As such, my Grafana instance was using SQLite database file stored in the persistent volume. This limits me to a single Grafana pod, since the volume is not setup to shared across pods. Additionally, that SQL database file is to the lifecycle of the claim associated with the volume.

    And, well, at home, this is not a huge deal because of how the lab is setup for persistent volume claims. Since I use the nfs-subdir-external-provisioner, PVCs in my clusters automatically generate a subfolder in my NFS share. When the PVC is deleted, the subdir gets renamed with an archive- prefix, so I can usually dig through the folder to find the old database file.

    However, using the default Azure persistence, Azure Disks are provisioned. When a PVC gets deleted, so to does the disk, or, well, I think it does. I have not had the opportunity to dig in to the Azure Disk PVC provisioning to understand how that data is handled when PVCs go away. It is sufficient to say that I lost our Grafana settings because of this.

    The New Setup

    The new plan is to utilize MySQL to store my Grafana dashboards and data stores. The configuration seems simple enough: add the appropriate entries in the grafana.ini file. I already know how to expand secrets, so getting the database secrets into the configuration was easy using the grafana.ini section of the Helm chart.

    For my home setup, I felt it was ok to run MySQL as another dependent chart for Grafana. Now, from the outside, you should be saying “But Matt, that only moves your persistence issues from the Grafana chart to the MySQL chart!” That is absolutely true. But, well, I have a pretty solid backup plan for those NFS shares, so for a home lab that should be fine. Plus I figured out how to backup and restore Grafana (see below).

    The real reason is that, for the instance I am running in Azure at work, I want to provision an Azure MySQL instance. This will allow me to have much better backup retention that inside the cluster, but the configuration at work will match the configuration at home. Home lab in action!

    Want to check out my home lab configuration? Check out my ops-internal infrastructure repository.

    Backup and Restore for Grafana

    As part of this move, I did not want to lose the settings I had in Grafana. This mean finding a backup/restore procedure that worked. An internet search lead me to the Grafana Backup Tool. The tool provides backup and restore capabilities through Grafana’s APIs.

    That said, it is written in Python, so my recent foray into Python coding served me well to get this tool up and running. Once I generated an API Key, I was off and running.

    There really isn’t much to it: after configuring the URL and API Token, I ran a backup to get a .tar.gz file with my Grafana contents. Did I test the backup? No. It’s the home lab, worst that could happen is I have to re-import some dashboards and re-create some others.

    After that, I updated my Grafana instance to include the MySQL instance and updated Grafana’s configuration to use the new MySQL service. As expected, all my dashboards and data sources disappeared.

    I ran the restore function using my backup, refreshed Grafana in my browser, and I was back up and running! Testing, schmesting….

    What’s Next?

    I am going to take my newfound learnings and apply them at work:

    1. Get a new MySQL instance provisioned.
    2. Backup Grafana.
    3. Re-configure Grafana to use the new MySQL instance.
    4. Restore Grafana.

    Given the ease with which the home lab went, I cannot imagine I will run into much issue.

  • Speed. I.. am.. Speed.

    I have heard the opening to the Cars movie more times than I can count, and Owen Wilson’s little monologue always sticks in my head. Tangentially, well, I recently logged in to Xfinity to check my data usage, which sent me down a path towards tracking data usage inside of my network. I learned a lot about what to do, and what not to do.

    We are using how much data?

    Our internet went out on Sunday. This is not a normal occurrence, so I turned off the wifi on my phone and logged in to Xfinity’s website to report the problem. Out of sheer curiosity, after reporting the downtime, I clicked on the link to see our usage.

    22 TB… That’s right, 22 terabytes of data in February, and approaching 30TB for March. And there were still 12 days left in March! Clearly, something was going on.

    Asking the Unifi Controller

    I logged in to the Unifi controller software in the hopes of identifying the source of this traffic. I jumped into the security insights, traffic identification, and looked at the data for the last month. Not one client showed more than 25 GB of traffic in the last month. That does not match what Xfinity is showing at all.

    A quick Google search lead me to a few posts that suggest that the Unifi’s automated speed test can boost your data usage, but that it doesn’t show on the Unifi. Mind you, these posts were 4+ years old, but I figured it was worth a shot. So I disabled the speed test in the Unifi controller, but would have to wait a day to see if the Xfinity numbers changed.

    Fast forward a day – No change. According to Xfinity I was using something like 500GB of data per day, which is nonsense. My previous months were never higher than 2TB, so using that much data in 4 days means something is wrong.

    Am I being hacked?

    Thanks to “security first” being beat into me by some of my previous security-focused peers, the first thought in my head was “Am I being hacked?” I looked through logs on the various machines and clusters, trying to find where this data was coming from and why. But nothing looked odd or out of the ordinary: no extra pods running calculations, no servers consuming huge amounts of memory or CPU. So where in the world was 50TB of data coming from?

    Unifi Poller to the Rescue

    At this point, I remembered that I have Unifi Poller running. The poller grabs data from my Unifi controller and puts it into Mimir. I started poking around the unpoller_ metrics until I found unpoller_device_bytes_total. Looking at that value for my Unifi Security Gateway, well, I saw this graph for the last 30 days:

    unpoller_device_bytes_total – Last 30 Days

    The scale on the right is bytes, so, 50,000,000,000,000 bytes, or roughly 50TB. Since I am not yet collecting the client DPI information with Unifi Poller, I just traced this data back to the start of this ramp up. It turned out to be February 14th at around 12:20 pm.

    GitOps for the win

    Since my cluster states are stored in Git repos, any changes to the state of things are logged as commits to the repository, making it very easy to track back. Combing through my commits for 2/14 around noon, I found the offending commit in the speedtest-exporter (now you see the reference to Lightning McQueen).

    In an effort to move off of the k8s-at-home charts, which are no longer being maintained, I have switch over to creating charts using Bernd Schorgers’ library chart to manage some of the simple installs. The new chart configured the ServiceMonitor to scrape every minute, which meant, well, that I was running a speed test every minute. Of every day. For a month.

    To test my theory, I shut down the speedtest-exporter pod. Before my change, I was seeing 5 and 6 GB of traffic every 30 seconds. With the speed test executing hourly, I am seeing 90-150 MB of traffic every 30 seconds. Additionally, the graph is much more sporadic, which makes more sense: I would expect traffic to increase when my kids are home and watching TV, and decrease at night. What I was seeing was a constant increase over time, which is what pointed me to the speed test. So I fixed the ServiceMonitor to only scrape once an hour, and I will check my data usage in Xfinity tomorrow to see how I did.

    My apologies to my neighbors and Xfinity

    Breaking this down, I have been using something around 1TB of bandwidth per day over the last month. So, I apologize to my neighbors for potentially slowing everything down, and to Xfinity for running speed tests once a minute. That is not to say that I will stop running speed tests, but I will go back to testing once an hour rather than once a minute.

    Additionally, I’m using my newfound knowledge of the Unifi Poller metrics to write some alerts so that I can determine if too much data is coming in and out of the network.

  • A Lesson in Occam’s Razor: Configuring Mimir Ruler with Grafana

    Occam’s Razor posits “Of two competing theories, the simpler explanation is to be preferred.” I believe my high school biology teacher taught the “KISS” method (Keep It Simple, Stupid) to convey a similar principle. As I was trying to get alerts set up in Mimir using the Grafana UI, I came across an issue that was only finally solved by going back to the simplest answer.

    The Original Problem

    One of the reasons I leaned towards Mimir as a single source of metrics is that it has its own ruler component. This would allow me to store alerts in, well, two places: the Mimir ruler for metrics rules, and the Loki ruler for log rules. Sure, I could use Grafana alerts, but I like being able to run the alerts in the native system.

    So, after getting Mimir running, I went to add a new alert in Grafana. Neither my Mimir nor Loki data source showed up.

    Now, as it turns out, the Loki one was easy: I had not enabled the Loki ruler component in the Helm chart. So, a quick update of the Loki configuration in my GitOps repo, and Loki was running. However, my Mimir data source was still missing.

    Asking for help

    I looked for some time at possible solutions, and posted to the GitHub discussion looking for assistance. Dimitar Dimitrov was kind enough to step and and give me some pointers, including looking at the Chrome network tab to figure out if there were any errors.

    At that point, I first wanted to bury my head in the sand, as, of all the things I looked at, that was not one of the things I had done. But, after getting over my embarrassment, I went about debugging. It looked like some errors around the Access-Control-Allow-Origin header not being available, although, there was also a 405 when attempting to make a preflight OPTIONS request.

    Dimitar suggested that I add that Access-Control-Allow-Origin via the Mimir Helm chart’s nginx.nginxConfig section. So I did, and but still had the same problems.

    I had previously tried this on Microsoft Edge and it had worked, and had been assuming it was just Chrome being overly strict. However, just for fun, I tried in a different Chrome profile, and it worked…

    Applying Occam’s Razor

    It was at this point that I thought “There is no way that something like this would not have come up with Grafana/Mimir on Chrome.” I would have expected a quick note around the Access-Control-Allow-Origin or something similar when hosting Grafana in a different subdomain than Mimir. So I took a longer look at the network log in my instance of Chrome that was still throwing errors.

    There was a redirect in there that was listed as “cached.” I thought, well, that’s odd, why would it cache that redirect? So, as a test, I disabled the cache in the Chrome debugging tools, refreshed the page, and viola! Mimir showed as a data source for alerts.

    Looking at the resulting success calls, I noted that ALL of the calls were proxied through grafana.mydomain.local, which made me think “Do I really even need the Access-Control-Allow-Origin headers? So, I removed those headers from my Mimir configuration, re-deployed, and tested with the caching disabled. It worked like a champ.

    What happened?

    The best answer I can come up with is that, at some point in my clicking around Grafana with Mimir as a data source, Chrome got a 301 redirect response from https://grafana.mydomain.local/api/datasources/proxy/14/api/v1/status/buildinfo, cached it, and used it in perpetuity. Disabled the response cache fixed everything, without the need to further configure Mimir’s Nginx proxy to return special CORS headers.

    So, with many thanks to Dimitar for being my rubber duck, I am now able to start adding new rules for monitoring my cluster metrics.

  • Kubernetes Observability, Part 5 – Using Mimir for long-term metric storage

    This post is part of a series on observability in Kubernetes clusters:

    For anyone who actually reads through my ramblings, I am sure they are asking themselves this question: “Didn’t he say Thanos for long-term metric storage?” Yes, yes I did say Thanos. Based on some blog posts from VMWare that I had come across, my initial thought was to utilize Thanos as my long term metrics storage solution. Having worked with Loki first, though, I became familiar with the distributed deployment and storage configuration. So, when I found that Mimir has similar configuration and compatibility, I thought I’d give it a try.

    Additionally, remember, this is for my home lab: I do not have a whole lot of time for configuration and management. With that in mind, using the “Grafana Stack” seemed an expedited solution.

    Getting Started with Mimir

    As with Loki, I started with the mimir-distributed Helm chart that Grafana provides. The charts are well documented, including a Getting Started guide on their website. The Helm chart includes a Minio dependency, but, as I already setup Minio, I disabled the included chart and configured a new bucket for Mimir.

    As Mimir has all the APIs to stand in for Prometheus, the changes were pretty easy:

    1. Get an instance of Mimir installed in my internal cluster
    2. Configure my current Prometheus instances to remote-write to Mimir
    3. Add Mimir as a data source in Grafana.
    4. Change my Grafana dashboards to use Mimir, and modify those dashboards to filter based on the cluster label that is added.

    As my GitOps repositories are public, have a look at my Mimir-based chart for details on my configuration and deployment.

    Labels? What labels?

    I am using the Bitnami kube-prometheus Helm chart. That particular chart allows you to define external labels using the prometheus.externalLabels value. In my case, I created a cluster label with unique values for each of my clusters. This allows me to create dashboards with a single data source which can be filtered based on each cluster using dashboard variables.

    Well…. that was easy

    All in all, it took me probably two hours to get Mimir running and collecting data from Prometheus. It took far less time than I anticipated, and opened up some new doors to reduce my observability footprint, such as:

    1. I immediately reduced the data retention on my local Prometheus installations to three days. This reduced my total disk usage in persistent volumes by about 40 GB.
    2. I started researching using the Grafana Agent as a replacement for a full Prometheus instance in each cluster. Generally, the agent should use less CPU and storage on each cluster.
    3. Grafana Agent is also a replacement for Promtail, meaning I could remove both my kube-prometheus and promtail tools and replace them with an instance of the Grafana Agent. This greatly simplifies the configuration of observability within the cluster.

    So, what’s the catch? Well, I’m tying myself to an opinionated stack. Sure, it’s based on open standards and has the support of the Prometheus community, but, it remains to be seen what features will become “pay to play” within the stack. Near the bottom of this blog post is some indication of what I am worried about: while the main features are licensed under AGPLv3, there are additional features that come with proprietary licensing. In my case, for my home lab, these features are of no consequence, but, when it comes to making decisions for long term Kubernetes Observability at work, I wonder what proprietary features we will require and how much it will cost us in the long run.

  • Getting Synology SNMP data into Prometheus

    With my new cameras installed, I have been spending a lot more time in the Diskstation Manager (DSM). I always forget how much actually goes on within the Synology, and I am reminded of that every time I open the Resource Monitor.

    At some point, I started to wonder whether or not I could get this data into my metrics stack. A quick Google search brought me to the SNMP Exporter and the work Jinwoong Kim has done to generate a configuration file based on the Synology MIBs.

    It took a bit of trial and error to get the configuration into the chart, mostly because I forgot how to make a multiline string in a Go template. I used the prometheus-snmp-exporter chart as a base, and built out a chart in my GitOps repo to have this deployed to my internal monitoring cluster.

    After grabbing one of the community Grafana dashboards, this is what I got:

    Security…. Not just yet

    Apparently, SNMP has a few different versions. v1 and v2 are insecure, utilizing a community string for simple authentication. v3 has added a more complex authentication and encryption, but this makes configuration more complex.

    Now, my original intent was to enable SNMPv3 on the Synology, however, I quickly realized that meant putting the auth values for SNMP v3 in my public configuration repo, as I haven’t figured out how to include that configuration as a secret (instead of a configmap). So, just to get things running, I settled on just enabling V1/V2 with the community string. These SNMP ports are blocked from external traffic, so I’m not too concerned with it, but my to-do list now includes a migration to SNMPv3 for the Synology.

    When you have a hammer…

    … everything looks like a nail. The SNMP Exporter I am currently running only has configuration for Synology, but the default if_mib module that is packaged with the SNMP Exporter can grab data from a number of products which support it. So I find myself looking for “nails” as it were, or objects from which I can gather SNMP data. For the time being, I am content with the Synology data, but don’t be surprised if my network switch and server find themselves on the end of SNMP collection. I would say that my Unifi devices warrant it, but the Unifi Poller I have running takes care of gathering that data.

  • Kubernetes Observability, Part 3 – Dashboards with Grafana

    This post is part of a series on observability in Kubernetes clusters:

    What good is Loki’s log collection or Prometheus’ metrics scraping without a way to see it all? Both Loki and Prometheus are products born from Grafana Labs, which is building itself as an observability stack similar to Elastic. I have used both, and Grafana’s stack is much easier to get started with than Elastic. For my home lab, it is perfect: simple start and config, fairly easy monitoring, and no real administrative work to do.

    Installing Grafana

    The Helm chart provided for Grafana was very easy to use, and the “out-of-box” configuration was sufficient to get started. I configured a few more features to get the most out of my instance:

    • Using an existing secret for admin credentials: this secret is created by an ExternalSecret resource that pulls secrets from my local Hashicorp Vault instance.
    • Configuration for Azure AD users. Grafana’s documentation details additional actions that need done in your Azure AD instance. Note that the envFromSecret is the Kubernetes Secret that gets expanded to the environment, storing my Azure AD ClientID and Client Secret.
    • Added the grafana-piechart-panel plugin, as some of the dashboards I downloaded referenced that.
    • Enabled a Prometheus ServiceMonitor resource to scrape Grafana metrics.
    • Annotated the services and pods for Linkerd.

    Adding Data Sources

    With Grafana up and running, I added 5 data sources. Each of my clusters has their own Prometheus instance, and I added my Loki instance for logs. Eventually, Thanos will aggregate my Prometheus metrics into a single data source, but that is a topic for another day.

    Exploring

    The Grafana interface is fairly intuitive. The Explore section lets you poke around your data sources to review the data you are collecting. There are “builders” available to help you construct your PromQL (Prometheus Query Language) or LogQL (Log Query Language) queries. Querying Metrics automatically displays a line chart with your values, making it pretty easy to review your query results and prepare the query for inclusion in a dashboard.

    When debugging my applications, I use the Explore section almost exclusively to review incoming logs. A live log view and search with context makes it very easy to find warnings and errors within the log entries and determine issues.

    Building Dashboards

    In my limited use of the ELK stack, one thing that always got me was the barrier of entry into Kibana. I have always found that having examples to tinker with is a much easier way for me to learn, and I could never find good examples of Kibana dashboards that I could make my own.

    Grafana, however, has a pretty extensive list of community-built dashboards from which I could begin my learning. I started with some of the basics, like Cluster Monitoring for Kubernetes. And, well, even in that one, I ran into some issues. The current version in the community uses kubernetes_io_hostname as a label for the node name. It would seem that label has changed to node in kube-state-metrics, so I had to import that dashboard and make changes to the queries in order for the data to show.

    I found a few other dashboards that illustrated what I could do with Grafana:

    • ArgoCD – If you add a ServiceMonitor to ArgoCD, you can collect a number of metrics around application status and synchronizations, and this dashboard gives a great view of how ArgoCD is running.
    • Unifi Poller DashboardsUnifi Poller is an application that polls the Unifi controller to pull metrics and expose them to Prometheus in the OpenTelemetry standard. It includes dashboards for a number of different metrics.
    • NGINX Ingress – If you use NGINX for your Ingress controller, you can add a ServiceMonitor to scrape metrics on incoming network traffic.

    With these examples in hand, I have started to build out my own dashboards for some of my internal applications.