Category: Home Lab

My Introduction to Kubernetes NetworkPolicy

The Bitnami Redis Helm chart has thrown me a curve ball over the last week or so, and made me look at Kubernetes NetworkPolicy resources.

Redis Chart Woes

Bitnami seems to be updating their charts to include default NetworkPolicy resources. While I don’t mind this, a jaunt through their open issues suggests that it has not been a smooth transition.

The redis chart’s initial release of NetworkPolicy objects broke the metrics container, since the default NetworkPolicy didn’t add the metrics port to allowed ingress ports.

So I sat on the old chart until the new Redis chart was available.

And now, Connection Timeouts

Once the update was released, I rolled out the new version of Redis. The containers came up, and I didn’t really think twice about it. Until, that is, I decided to do some updates to both my applications and my Kubernetes nodes.

I upgraded some of my internal applications to .Net 8. This caused all of them to restart, and, in the process, get their linkerd-proxy sidecars running. I also started cycling the nodes on my internal cluster. When it came time to call my Unifi IP Manager API to delete an old assigned IP, I got an internal server error.

A quick check of the logs showed that the pod’s Redis connection was failing. Odd, I thought, since most other connections have been working fine, at least through last week.

After a few different Google searches, I came across this section in the Linkerd.io documentation. As it turns out, when you use NetworkPolicy resources and opaque ports (like Redis), you have to make sure that Linkerd’s inbound port (which defaults to 4143) is also setup in the NetworkPolicy.

Adding the Linkerd port to the extraIngress section in the Redis Helm chart worked wonders. With that section in place, connectivity was restored and I could go about my maintenance tasks.

NetworkPolicy for all?

Maybe. This is my first exposure to them, so I would like to understand how they operate and what best practices are for such things. In the meantime, I’ll be a little more wary when I see NetworkPolicy resources pop up in external charts.

February 6, 2024
Environment Woes
No, this is not a post on global warming. As it turns out, I have been provisioning my Azure DevOps build agents somewhat incorrectly, at least for certain toolsets.

Sonar kicks it off

It started with this error in my build pipeline:
```
ERROR: 

The version of Java (11.0.21) used to run this analysis is deprecated, and SonarCloud no longer supports it. Please upgrade to Java 17 or later.
As a temporary measure, you can set the property 'sonar.scanner.force-deprecated-java-version' to 'true' to continue using Java 11.0.21
This workaround will only be effective until January 28, 2024. After this date, all scans using the deprecated Java 11 will fail.
```
I provision my build agents using the Azure DevOps/GitHub Actions runner images repository, so I know it has multiple versions of Java. I logged in to the agent, and the necessary environment variables (including JAVA_HOME_17_X64) are present. However, adding the jdkVersion input made no difference.
```
- task: SonarCloudAnalyze@1
  inputs:
    jdkversion: 'JAVA_HOME_17_X64'
```
I also tried using the JavaToolInstaller step to install Java 17 prior, and I got this error:
```
##[error]Java 17 is not preinstalled on this agent
```
Now, as I said, I KNOW it’s installed. So, what’s going on?

All about environment

The build agent has the proper environment variables set. As it turns out, however, the build agent needs some special setup. Some research on my end led me to Microsoft’s page on the Azure DevOps Linux Agents, specifically, the section on environmental variables.

I checked my .env file in my agent directory, and it had a scrawny 5-6 entries. As a test, I added JAVA_HOME_17_X64 with a proper path as an entry in that file and restarted the agent. Presto! No more errors, and Sonar Analysis ran fine.

Scripting for future agents

With this in mind, I updated the script that installs my ADO build agent to include steps to copy environment variables from the machine to the .env file for the agent, so that the agent knows what is on the machine. After a couple tests (forgot a necessary sudo), I have a working provisioning script.
February 5, 2024
A Tale of Two Proxies
I am working on building a set of small reference applications to demonstrate some of the patterns and practices to help modernize cloud applications. In configuring all of this in my home lab, I spent at least 3 hours fighting a problem that turned out to be a configuration issue.

Backend-for-Frontend Pattern

I will get into more details when I post the full application, but I am trying to build out a SPA with a dedicated backend API that would host the SPA and take care of authentication. As is typically the case, I was able to get all of this working on my local machine, including the necessary proxying of calls via the SPA’s development server (again, more on this later).

At some point, I had two containers ready to go: a BFF container hosting the SPA and the dedicated backend, and an API container hosting a data service. I felt ready to deploy to the Kubernetes cluster in my lab.

Let the pain begin!

I have enough samples within Helm/Helmfile that getting the items deployed was fairly simple. After fiddling with the settings of the containers, things were running well in the non-authenticated mode.

However, when I clicked login, the following happened:
1. I was redirected to my oAuth 2.0/OIDC provider.
2. I entered my username/password
3. I was redirected back to my application
4. I got a 502 Bad Gateway screen
502! But, why? I consulted Google and found any number of articles indicating that, in the authentication flow, Nginx’s default header size limit is too small to limit what might be coming back from the redirect. So, consulting the Nginx configuration documents, I changed the Nginx configuration in my reverse proxy to allow for larger headers.

No luck. Weird. In the spirit of true experimentation (change one thing at a time), I backed those changes out and tried changing the configuration of my Nginx Ingress controller. No luck. So what’s going on?

Too Many Cooks

My current implementation looks like this:
```
flowchart TB
    A[UI] --UI Request--> B(Nginx Reverse Proxy)
    B --> C("Kubernetes Ingress (Nginx)")
    C --> D[UI Pod]
```
There are two Nginx instances between all of my traffic: an instance outside of the cluster that serves as my reverse proxy, and an Nginx ingress controller that serves as the reverse proxy within the cluster.

I tried changing both separately. Then I tried changing both at the same time. And I was still seeing this error. As it turns out, well, I was being passed some bad data as well.

Be careful what you read on the Internet

As it turns out, the issue was the difference in configuration between the two Nginx instances and some bad configuration values that I got from old internet articles.

Reverse Proxy Configuration

For the Nginx instance running on Ubuntu, I added the following to my nginx.conf file under the http section:
```
        proxy_buffers 4 512k;
        proxy_buffer_size 256k;
        proxy_busy_buffers_size 512k;
        client_header_buffer_size 32k;
        large_client_header_buffers 4 32k;
```
Nginx Ingress Configuration

I am running RKE2 clusters, so configuring Nginx involves a HelmChartConfig resource being created in the kube-system namespace. My cluster configuration looks like this:
```
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-ingress-nginx
  namespace: kube-system
spec:
  valuesContent: |-
    controller:
      kind: DaemonSet
      daemonset:
        useHostPort: true
      config:
        use-forwarded-headers: "true"
        proxy-buffer-size: "256k"
        proxy-buffers-number: "4"
        client-header-buffer-size: "256k"
        large-client-header-buffers: "4 16k"
        proxy-body-size: "10m"
```
The combination of both of these settings got my redirects to work without the 502 errors.

Better living through logging

One of the things I fought with on this was finding the appropriate logs to see where the errors were occurring. I’m exporting my reverse proxy logs into Loki using a Promtail instance that listens on a syslog port. So I am “getting” the logs into Loki, but I couldn’t FIND them.

I forgot about the facility in syslog: I have the access logs sending as local5, but did configured the error logs without pointing them to local5. I learned that, by default, they go to local7.

Once I found the logs I was able to diagnose the issue, but I spent a lot of time browsing in Loki looking for those logs.
January 23, 2024
Re-configuring Grafana Secrets
I recently fixed some synchronization issues that had been silently plaguing some of the monitoring applications I had installed, including my Loki/Grafana/Tempo/Mimir stack. Now that the applications are being updated, I ran into an issue with the latest Helm chart’s handling of secrets.

Sync Error?

After I made the change to fix synchronization of the Helm charts, I went to sync my Grafana chart, but received a sync error:
```
Error: execution error at (grafana/charts/grafana/templates/deployment.yaml:36:28): Sensitive key 'database.password' should not be defined explicitly in values. Use variable expansion instead.
```
I certainly didn’t change anything in those files, and I am already using variable expansion in the values.yaml file anyway. What does that mean? Basically, in the values.yaml file, I used ${ENV_NAME} in areas where I had a secret value, and told Grafana to expand environment variables into the configuration.

The latest version of Helm doesn’t seem to like this. It views ANY value in secret fields to be bad. A search of the Grafana Helm Chart repo’s issues list yielded someone with a similar issue and a comment with a link to another comment that is the recommended solution.

Same Secret, New Name

After reading through the comment’s suggestion and Grafana’s documentation on overriding configuration with environment variables, I realized the fix was pretty easy.

I already had a Kubernetes secret being populated from Hashicorp Vault with my secret values. I also already had envFromSecret set in the values.yaml to instruct the chart to use my secret. And, through some dumb luck, two of the three values were already named using the standards in Grafana’s documentation.

So the “fix” was to simply remove the secret expansions from the values.yaml file, and rename one of the secretKey values so that it matched Grafana’s environment variable template. You can see the diff of the change in my Github repository.

With that change, the Helm chart generated correctly, and once Argo had the changes in place, everything was up and running.
January 16, 2024
Synced, But Not: ArgoCD Differencing Configuration
Some of the charts in my Loki/Grafana/Tempo/Mimir stack have an odd habit of not updating correctly in ArgoCD. I finally got tired of it and fixed it… I’m just not 100% sure how.

Ignoring Differences

At some point in the past, I had customized a few of my Application objects with ignoreDifferences settings. It was meant to tell ArgoCD to ignore fields that are managed by other things, and could change from the chart definition.

Like what, you might ask? Well, the external-secrets chart generates it’s own caBundle and sets properties on a ValidatingWebhookConfiguration object. Obviously, that’s managed by the controller, and I don’t want to mess with it. However, I also don’t want ArgoCD to report that chart as Out of Sync all the time.

So, as an example, my external-secrets application looks like this:
```
project: cluster-tools
source:
  repoURL: 'https://github.com/spydersoft-consulting/ops-argo'
  path: cluster-tools/tools/external-secrets
  targetRevision: main
destination:
  server: 'https://kubernetes.default.svc'
  namespace: external-secrets
syncPolicy:
  syncOptions:
    - CreateNamespace=true
    - RespectIgnoreDifferences=true
ignoreDifferences:
  - group: '*'
    kind: '*'
    managedFieldsManagers:
      - external-secrets
```
And that worked just fine. But, with my monitor stack, well, I think I made a boo-boo.

Ignoring too much

When I looked at the application differences for some of my Grafana resources, I noticed that the live vs desired image was wrong. My live image was older than the desired one, and yet, the application wasn’t showing as out of sync.

At this point, I suspected ignoreDifferences was the issue, so I looked at the Application manifest. For some reason, my monitoring applications had an Application manifest that looked like this:
```
project: external-apps
source:
  repoURL: 'https://github.com/spydersoft-consulting/ops-internal-cluster'
  path: external/monitor/grafana
  targetRevision: main
  helm:
    valueFiles:
      - values.yaml
    version: v3
destination:
  server: 'https://kubernetes.default.svc'
  namespace: monitoring
syncPolicy:
  syncOptions:
    - RespectIgnoreDifferences=true
ignoreDifferences:
  - group: "*"
    kind: "*"
    managedFieldsManagers:
    - argocd-controller
  - group: '*'
    kind: StatefulSet
    jsonPointers:
      - /spec/persistentVolumeClaimRetentionPolicy
      - /spec/template/metadata/annotations/'kubectl.kubernetes.io/restartedAt'
```
Notices the part where I am ignoring managed fields from argocd-controller. I have no idea why I added that, but, it looks a little “all inclusive” for my tastes, and it was ONLY present in the ApplicationSet for my LGTM stack. So I commented it out.

Now We’re Cooking!

Lo and behold, ArgoCD looked at my monitoring stack and said “well, you have some updates, don’t you!” I spent the next few minutes syncing those applications individually. Why? There are a lot of hard working pods in those applications, I don’t like to cycle them all at once.

I searched through my posts and some of my notes, and I honestly have no idea why I decided I should ignore all fields managed by argocd-controller. Needless to say, I will not be doing that again.
January 11, 2024
Proxy Down!
A simple package updates seems to have adversely affected the network on my Banana Pi proxy server. I wish this was a post on how I solved it, but, alas, I haven’t solved it.

It was a simple upgrade…

That statement has been uttered prior to massive outages a time or two. I logged in to my proxy server, ran apt update and apt upgrade, and restarted the device.

Now, this is a headless device, so I tried re-connecting via SSH a few times. Nothing. I grabbed the device out of the server closet and connected it to a small “workstation” I have in my office. I booted it up, and, after a pretty long boot cycle, it came up, but with the network adapter down.

Why’s the network down?

Running ip addr show displayed eth0 as being down. But, well, I have no idea why. And, since this is something of a production device, I manually enabled as follows:
```
# enable the link
sudo ip link set eth0 up
# restart networking
sudo systemctl restart systemd-networkd
# restart nginx
sudo systemctl restart nginx
```
This all worked and the network was back up and running. But what happened?

Digging through the syslog file, I came across this:
```
Jan  8 14:09:28 m5proxy systemd-udevd[2204]: eth0: Failed to query device driver: Device or resource busy
```
A little research yielded nothing really fruitful. I made sure that everything was setup correctly, including cloud-init and netplan, but nothing worked.

What Now?

I am in a weird spot. The proxy is working, and it’s such a vital piece of my current infrastructure that I do not currently have the ability to “play with it.”

What I should probably do is create a new proxy machine, either with one of my spare Raspberry PIs or just a VM on the server. Then I can transfer traffic to the new proxy and diagnose this one without fear of downtime.

Before I do that, though, I’m going to do some more research into the error above and see if I can tweak some of the configuration to get things working on a reboot. Regardless, I will post a solution when one comes about.
January 9, 2024
More GitOps Fun!
I have been curating some scripts that help me manage version updates in my GitOps repositories… It’s about time they get shared with the world.

What’s Going On?

I manage the applications in my Kubernetes clusters using Argo CD and a number of Git repositories. Most of the ops- repositories act as “desired state” repositories.

As part of this management, I have a number of external tools running in my clusters that are installed using their Helm charts. Since I want to keep my installs up to date, I needed a way to update the Helm chart versions as new releases came out.

However.. some external tools do not have their own Helm charts. For that, I have been using a Helm library chart from bjw-s. In that case, I have had to manually find new releases and update my values.yaml file.

While I have had the Helm chart version updates automated for some time, I just recently got around to updating the values.yaml file from external sources. Now is a good time to share!

The Scripts

I put the scripts in the ops-automation repository in the Spydersoft organization. I’ll outline the basics of each script, but if you are interested in the details, check out the scripts themselves.

It is worth nothing that these scripts require the git and helm command line tools to be installed, in addition to the Powershell Yaml module.

Also, since I manage more than one repository, all of these scripts are designed to be given a basePath and then a list of directory names for the folders that are the Git repositories I want to update.

Update-HelmRepositoryList

This script iterates through the given folders to find the chart.yaml files in it. For every dependency in the found chart files, it adds the repository to the local helm if the URL does not already exist.

Since I have been running this on my local machine, I only have to do this once. But, on a build agent, this script should be run every time to make sure the repository list contains all the necessary repositories for an update.

Update-HelmCharts

This script iterates through the given folders to find the chart.yaml files in it. For every dependency, the script determines if there is an updated version of the dependency available.

If there is an update available, the Chart.yaml file is updated, and helm dependency update is run to update the Chart.lock file. Additionally, commit comments are created to note the version changes.

For each chart.yaml file, a call to Update-FromAutoUpdate will be made to make additional updates if necessary.

Update-FromAutoUpdate

This script looks for a file called auto-update.json in the path given. The file has the following format:
```
{
    "repository": "redis-stack/redis-stack",
    "stripVFromVersion": false,
    "tagPath": "redis.image.tag"
}
```
The script looks for the latest release from the repository in Github, using tag_name from Github as the version. If the latest release is newer than the current tagPath in values.yaml, the script then updates the tagPath in the values.yaml file to the new version. The script returns an object indicating whether or not an update was made, as well as a commit comment indicating the version jump.

Right now, the auto-update only works for images that come from Github releases. I have one item (Proget) that needs to search a docker API directly, but that will be a future enhancement.

Future Tasks

Now that these are automated tasks, I will most likely create an Azure Pipeline that runs weekly to get these changes made and committed to Git.

I have Argo configured to not auto-sync these applications, so even though the changes are made in Git, I still have to manually apply the updates. And I am ok with that. I like to stagger application updates, and, in some cases, make sure I have the appropriate backups before running an update. But this gets me to a place where I can log in to Argo and sync apps as I desire.
December 28, 2023
Do the Right Thing
My home lab clusters have been running fairly stable, but there are still some hiccups every now and again. As usual, a little investigation lead to a pretty substantial change.

Cluster on Fire

My production and non-production clusters, which mostly host my own projects, have always been pretty stable. Both clusters are set up with 3 nodes as the control plane, since I wanted more than 1 and you need an odd number for quorum. And since I didn’t want to run MORE machines as agents, I just let those nodes host user workloads in addition to the control plane. With 4 vCPUs and 8 GB of RAM per node, well, those clusters had no issues.

My “internal tools” cluster is another matter. Between Mimir, Loki, and Tempo running ingestion, there is a lot going on in that cluster. I added a 4th node that serves as just an agent for that cluster, but I still had some pod stability issues.

I started digging into the node-exporter metrics for the three “control plane + worker” nodes in the internal cluster, and they were, well, on fire. The system load was consistently over 100% (the 15 minute average was something like 4.05 out of 4 on all three). I was clearly crushing those nodes. And, since those nodes hosted the control plane as well as the workloads, instability in the control plane caused instability in the cluster.

Isolating the Control Plane

At that point, I decided that I could not wait any longer. I had to isolate the control plane and etcd from the rest of my workloads. While I know that it is, in fact, best practice, I was hoping to avoid it in the lab, as it causes a slight proliferation in VMs. How so? Let’s do the math:

Right now, all of my clusters have at least 3 nodes, and internal has 4. So that’s 10 VMs with 4 vCPU and 8 GB of RAM assigned, or 40 vCPUs and 80 GB of RAM. If I want all of my clusters to have isolated control planes, that means more VMs. But…

Control plane nodes don’t need nearly the size if I’m not opening them up to other workloads. And for my non-production cluster, I don’t need the redundancy of multiple control plane nodes. So 4 vCPUs/8GB RAM becomes 2 vCPU/4GB RAM for control plane node, and I can use 1 node for the non-production control plane. But what about the work? To start, I’ll use 2 4 vCPUs/8GB RAM nodes for production and non-production, and 3 of those same node sizes for the internal cluster.

In case you aren’t keeping a running total, the new plan is as follows:
- 7 small nodes (2 vCPU/4GB RAM) for control plane nodes across the three clusters (3 for internal and production, 1 for non-production)
- 7 medium nodes (4 vCPU/8GB RAM) for worker nodes across the three clusters (2 for non-production and production, 3 for internal).
So, it’s 14 VMs, up from 10, but it is only an extra 2 vCPUs and 2 GB of RAM. I suppose I can live with that.

Make it so!

With the scripting of most of my server creation, I made a few changes to support this updated structure. I added a taint to the RKE2 configuration for the server so that only critical items are scheduled.
```
node-taint:
- CriticalAddonsOnly=true:NoExecute
```
I also removed any server nodes from the tfx-<cluster name> DNS record, since the Nginx pods will only run on agent nodes now.

Once that was done, I just had to provision new agent nodes for each of the clusters, and then replace the current server nodes with newly provisioned nodes that have a smaller footprint and the appropriate taints.

It’s worth noting, in order to prevent too much churn, I manually added the above taint to each existing server node AFTER I had all the agents provisioned but before I started replacing server nodes. That way, Kubernetes would not attempt to schedule a user workload coming off the old server onto another server node, but instead force it on to an agent. For your reference and mine, that command looks like this:
```
kubectl taint nodes <node name> CriticalAddonsOnly=true:NoSchedule
```
Results

I would classify this as a success with an asterisk next to it. I need more time to determine if the cluster stability, particularly for the internal cluster, improves with these changes, so I am not willing to declare outright victory.

It has, however, given me a much better view into how much processing I actually need in a cluster. For my non-production cluster, the two agents are averaging under 10% load, which means I could probably lose one agent and still be well under 50% load on that node. The production agents are averaging about 15% load. Sure, I could consolidate, but part of the desire is to have some redundancy, so I’ll stick with two agents in production.

The internal cluster, however, is running pretty hot. I’m running a number of pods for Grafana Mimir/Loki/Tempo ingestion, as well as Prometheus on that cluster itself. So those three nodes are running at about 50-55% average load, with spikes above 100% on the one agent that is running both the Prometheus collector and a copy of the Mimir ingester. I’m going to keep an eye on that and see if the load creeps up. In the meantime, I’ll also be looking to see what, if anything, can be optimized or offloaded. If I find something to fix, you can be sure it’ll make a post.
December 2, 2023
Git Out! Migrating to GitHub
Git is Git. Wherever it’s hosted, the basics are the same. But the features and community around tools has driven me to make a change.

Starting Out

My first interactions with Git happened around 2010, when we decided to move away from Visual SourceSafe and Subversion and onto Git. At the time, some of the cloud services were either in their infancy or priced outside of what our small business could absorb. So we stood up a small Git server to act as our centralized repository.

The beauty of Git is that, well, everyone has a copy of the repository locally, so it’s a little easier to manage the backup and disaster recovery aspects of a centralized Git server. So the central server is pretty much a glorified file share.

To the Cloud!

Our acquisition opened up access to some new tools, including Bitbucket Cloud. We quickly moved our repositories to Bitbucket Cloud so that we could decommission our self-hosted server.

Personally, I started storing my projects in Bitbucket Cloud. Sure, I had a GitHub account. But I wasn’t ready for everything to be public, and Bitbucket Cloud offered unlimited private repos. At the time, I believe GitHub was charging for private repositories.

I also try to keep my home setup as close to work as possible in most cases. Why? Well, if I am working on a proof of concept that involves specific tools and their interaction with one another, it’s nice to have a sandbox that I can control. My home lab ecosystem has evolved based on the ecosystem at my job:
- Self-hosted Git / TeamCity
- Bitbucket Cloud / TeamCity
- Bitbucket Cloud / Azure DevOps
- Bitbucket Cloud / Azure DevOps / ArgoCD
To the Hub!

Even before I changed jobs, a move to GitHub was in the cards, both personally and professionally.

Personally, as a community, I cannot think of a more popular platform than GitHub for sharing and finding open/public code. My GitHub profile is, in a lot of ways, a portfolio of my work and contributions. As I have started to invest more time into open source projects, my portfolio has grown. Even some of my “throw away” projects are worth a little, if only as a reference for what to do and what not to do.

Professionally, GitHub has made a great many strides in its Enterprise offering. Microsoft’s acquisition only pushed to give GitHub access to some of the CI/CD Pipeline solutions that Azure DevOps has, coupled with the ease of use of GitHub. One of the projects on the horizon at my old company was to identify if GitHub and GitHub actions could be the standard for build and deploy moving forward.

With my move, we have a mix of ecosystem: GitHub + Azure DevOps Pipelines. I would like to think, long term, I could get to GitHub + GitHub Actions (at least at home), the interoperability of Azure DevOps Pipelines with Azure itself makes it hard to migrate completely. So, with a new professional ecosystem in front of me, I decided it was time to drop BitBucket Cloud and move to GitHub for everything.

Organize and Move

Moving the repos is, well, simple. Using GitHub’s Import functionality, I pointed at my old repositories, entered my BitBucket Cloud username and personal access token, and GitHub imported it.

This simplicity meant I had time to think about organization. At this point, I am using GitHub for two pretty specific types of projects:
- Storage for repositories, either public or private, that I use for my own portfolio or personal projects.
- Storage for repositories, all public, that I have published as true Open Source projects.
I wanted to separate the projects into different organizations, since the hope is the true Open Source projects could see contributions from others in the future. So before I started moving everything, I created a new GitHub organization. As I moved repositories from BitBucket Cloud, I put them in either my personal GitHub space or this new organization space, based on their classification above. I also created a new SonarCloud organization to link to the new GitHub organization.

All Moved In!

It really only took about an hour to move all of my repositories and re-configure any automation that I had to point to GitHub. I setup new scans in the new SonarCloud organization and re-pointed the actions correctly, and everything seems to be working just fine.

With all that done, I deleted my BitBucket Cloud workspaces. Sure, I’m still using Jira Cloud and Confluence Cloud, but I am at least down a cloud service. Additionally, since all of the projects that I am scanning with Sonar are public, I moved them to SonarCloud and deleted my personal instance of SonarQube. One less application running in the home lab.
November 29, 2023
Stacks on Stacks!
I have Redis installed at home as a simple caching tool. Redis Stack adds on to Redis OSS with some new features that I am eager to start learning. But, well, I have to install it first.

Charting a Course

I have been using the Bitnami Redis chart to install Redis on my home K8 cluster. The chart itself provides the necessary configuration flexibility for replicas and security. However, Bitnami does not maintain a similar chart for redis-stack or redis-stack-server.

There are some published Helm charts from Redis, however, they lack the built-in flexibility and configurability that the Bitnami charts provide. The Bitnami chart is so flexible, I wondered if it was possible to use it with the redis-stack-server image. A quick search showed I was not the only person with this idea.

New Image

Gerk Elznik posted last year about deploying Redis Stack using Bitnami’s Redis chart. Based on this post, I made attempted to customize the Bitnami chart to use the redis-stack-server image. Gerk’s post indicated that a new script was needed to successfully start the image. That seemed like an awful lot of work, and, well, I really didn’t want to do that.

In the comments of Gerk’s post, Kamal Raj posted a link to his version of the Bitnami Redis Helm chart, modified for Redis Stack. This seemed closer to what I wanted: a few tweaks and off to the races.

In reviewing Kamal’s changes, I noticed that everything he changed could be overridden in the values.yaml file. So I made a few changes to my values file:
1. Added repository and tag in the redis.image section, pointing the chart to the redis-stack-server image.
2. Updated the command for both redis.master and redis.replica to reflect Kamal’s changes.
I ran a quick template, and everything looked to generate correctly, so I committed the changes and let ArgoCD take over.

Nope….

ArgoCD synchronized the stateful set as expected, but the pod didn’t start. The error in the K8 events was about “command not found.” So I started digging into the “official” Helm Chart for the redis-stack-server image.

That chart is very simple, which means it was pretty easy to see that there was no special command for startup. So, I started to wonder if I really needed to override the command, or simply use the redis-stack-server in place of the default image.

So I commented out the custom overrides to the command settings for both master and replica, and committed those changes. Lo and behold, ArgoCD synced and the pod started up great!

What Matters Is, Does It Work?

Excuse me for stealing from Celebrity Jeopardy, but “Gussy it up however you want, Trebek, what matters is, does it work?” For that, I needed a Redis client.

Up to this point, most of my interactions with Redis have simply been through the redis-cli that’s installed on the image. I use kubectl to get into the pod and run redis-cli in the pod to see what keys are in the instance.

Sure, that works fine, but as I start to dig into to Redis a bit more, I need a client that lets me visualize the database a little better. As I was researching Redis Stack, I came across RedisInsight, and thought it was worth a shot.

After installing RedisInsight, I set up port forwarding on my local machine into the Kubernetes service. This allows me to connect directly to the Redis instance without creating a long term service on Node Port or some other forwarding mechanism. Since I only need access to the Redis server within the cluster, this helps me secure it.

I got connected, and the instance shows. But, no modules….

More Hacking Required

As it turns out, the Bitnami Redis chart changes the startup command to a script within the chart. This allows some of the flexibility, but comes at the cost of not using the entrypoint scripts that are in the image. Specifically, the entrypoint script for redis-stack-server, which uses the command line to load the modules.

Now what? Well, there’s more than one way to skin a cat (to use an arcane and cruel sounding metaphor). Reading through the Redis documentation, you can also load modules through the configuration. Since the Bitnami Redis chart allows you to add to the configuration using the values.yaml file, that’s where I ended up. I added the following to my values.yaml file:
```
master:
    configuration: | 
      loadmodule /opt/redis-stack/lib/redisearch.so MAXSEARCHRESULTS 10000 MAXAGGREGATERESULTS 10000
      loadmodule /opt/redis-stack/lib/redistimeseries.so
      loadmodule /opt/redis-stack/lib/rejson.so
      loadmodule /opt/redis-stack/lib/redisbloom.so
```
With those changes, I now see the appropriate modules running.

Lots Left To Do

As I mentioned, this seems pretty “hacky” to me. Right now, I have it running, but only in standalone mode. I haven’t had the need to run a full Redis cluster, but I’m SURE that some additional configuration will be required to apply this to running a Redis Stack cluster. Additionally, I could not get the Redis Gears module loaded, but I did get Search, JSON, Time Series, and Bloom installed.

For now, that’s all I need. Perhaps if I find I need Gears, or I want to run a Redis cluster, I’ll have to revisit this. But, for now, it works. The full configuration can be found in my non-production infrastructure repository. I’m sure I’ll move to production, but everything that happens here happens in non-production first, so keep tabs on that if you’d like to know more.
November 21, 2023