• Lessons in Managing my Kubernetes Cluster: Man Down!

    I had a bit of a panic this week as routine tasks took me down a rabbit hole in Kubernetes. The more I manage my home lab clusters, the more I realize I do not want to be responsible for bare metal clusters at work.

    It was a typical upgrade…

    With ArgoCD in place, the contents of my clusters is very neatly defined in my various infrastructure repositories. I even have a small Powershell script that checks for the latest versions of the tools I have installed and updates my repositories with the latest and greatest.

    I ran that script today and noticed a few minor updates to some of my charts, so I figured I would apply those at lunch. Pretty typically, it is a very smooth application and the updates are applied in a few minutes.

    However, after having ArgoCD sync, I realized that the Helm charts I was upgrading was stuck in a “Progressing” state. I checked the pods, and they were in a perpetual “Pending” state.

    My typical debugging found nothing: there were no events around the pod being unable to be scheduled for any particular reason, and there was no log file, since the pods were not even being created.

    My first Google searches suggested a problem with persistent volumes/persistent volume claims. So I poked around in those, going so far as deleting them (after backing up the folders in my NFS target), but, well, no luck.

    And there it was…

    As it was “unschedulable,” I started trying to find the logs for the scheduler. I could not find them. So I logged in to my control plane node to see if the scheduler was even running… It was not.

    I checked the logs for that container, and nothing stood out. It just kind of “died.” I tried restarting the Docker container…. Nope. I even tried re-running the rke up command for that cluster, to see if Rancher Kubernetes Engine would restart it for me…. Nope.

    So, running out of options, I changed my cluster.yml configuration file to add a control plane role to another node in the cluster and re-ran rke up. And that, finally, worked. At that point, I removed the control plane role from the old control plane node and modified the DNS entries to point my hostnames to the new control plane node. With that, everything came back up.

    Monitoring?

    I wanted to write an alert in Mimir to catch this so I would know about it before I dug around. It was at this point I realized that I am not collecting any metrics from the Kubernetes components themselves. And, frankly, I am not sure how. RKE installs metrics-server by default, but I have not found a way to scrape metrics from Kubernetes components.

    My Google searches have been fruitless, and it has been a long work week, so this problem will most likely have to wait for a bit. If I come up with something I will update this post.

  • Hitting for the cycle…

    Well, I may not be hitting for the cycle, but I am certainly cycling Kubernetes nodes like it is my job. The recent OpenSSL security patches got me thinking that I need to cycle my cluster nodes.

    A Quick Primer

    In Kubernetes, a “node” is, well, a machine performing some bit of work. It could be a VM or a physical machine, but, at its core, it is a machine with a container runtime of some kind that runs Kubernetes workloads.

    Using the Rancher Kubernetes Engine (RKE), each node can have one or more of the following roles

    • worker – The node can host user workloads
    • etcd – The node is a member of the etcd storage cluster
    • controlplane – The node is a member of the control plane

    Cycle? What and why

    When I say “cycling” the nodes, I am actually referring to the process of provisioning a new node, adding the new node to the cluster, and removing an old node from the cluster. I use the term “cycling” because, when the process is complete, my cluster is “back where it started” in terms of resourcing, but with a fresh and updated node.

    But, why cycle? Why not just run the necessary security patches on my existing nodes? In my view, even at the home lab level, there a few reasons for this method.

    A Clean and Consistent Node

    As nodes get older, they incur the costs of age. They may have some old container images from previous workloads, or some cached copies of various system packages. In general, they collect stuff, and a fresh node has none of that cost. By provisioning new nodes, we can ensure that the latest provisioning is run and all the necessary updates are installed.

    By using newly provisioned nodes each time, it prevents me from applying special configuration to nodes. If I need a particular configuration on a node, well, it has to be done in the provisioning scripts. All my cluster nodes are provisioned the same way, which makes them much more like cattle than pets.

    No Downtime or Capacity Loss

    Running node maintenance can potentially require a system reboot or service restarts, which can trigger downtime. In order to prevent downtime, it is recommended that nodes be “cordoned” (prevent new workload from being scheduled on that node) and “drained” (remove workloads from the node).

    Kubernetes is very good at scheduling and moving workloads between nodes, and there is built-in functionality for cordon and draining of nodes. However, to prevent downtime, I have to remove a node for maintenance, which means I lose some cluster capacity during maintenance.

    When you think about it, to do a zero-downtime update, your choices are:

    1. Take a node out of the cluster, upgrade it, then add it back
    2. Add a new node to the cluster (already upgraded) then remove the old node.

    So, applying the “cattle” mentality to our infrastructure, it is preferred to have disposable assets rather than precisely manicured assets.

    RKE performs this process for you when you remove a node from the cluster. That means, running rke up will remove old nodes if they have been removed from your cluster.yml.

    To Do: Can I automate this?

    As of right now, this process is pretty manual, and goes something like this:

    1. Provision a new node – This part is automated with Packer, but I really only like doing 1 at a time. I am already running 20+ VMs on my hypervisor, I do not like the thought of spiking to 25+ just to cycle a node.
    2. Remove nodes, one role at a time – I have found that RKE is most stable when you only remove one role at a time. What does that mean? It means, if I have a node that is running all three nodes, I need to remove the control plane role, then run rke up. Then remove the etcd role, and run rke up again. Then remove the node completely. I have not had good luck simply removing a node with all three roles.
    3. Ingress Changes – I need to change the IPs on my cluster in two places:
      • In my external load balancer, which is a Raspberry Pi running nginx.
      • In my nginx-ingress installation on the cluster. This is done via my GitOps repository.
    4. DNS Changes – I have aliases setup for the control plan nodes so that I can swap them in and out easily without changing other configurations. When I cycle a control plane node, I need to update the DNS.
    5. Delete Old Node – I have a small Powershell script for this, but it is another step.

    It would be wonderful if I could automate this into an Azure DevOps pipeline, but there are some problems with that approach.

    1. Packer’s Hyper-V builder has to run on the host machine, which means I need to be able to execute the Packer build commands on my server. I’m not about about put the Azure DevOps agent directly on my server.
    2. I have not found a good way to automate DNS changes, outside of using the Powershell module.
    3. I need to automate the IP changes in the external proxy and the cluster ingress. Both of which are possible but would require some research on my end.
    4. I would need to automate the RKE actions, specifically, adding new nodes and deleting roles/nodes from the cluster, and then running rke up as needed.

    None of the above is impossible, however, it would require some effort on my part to research some techniques and roll them up into a proper pipeline. For the time being, though, I have set my “age limit” for nodes at 60 days, and will continue the manual process. Perhaps, after a round or two, I will get frustrated enough to start the automation process.

  • What’s in a (Release) Name?

    With Rancher gone, one of my clusters was dedicated to running Argo and my standard cluster tools. Another cluster has now become home for a majority of the monitoring tools, including the Grafana/Loki/Mimir/Tempo stack. That second cluster was running a little hot in terms of memory and CPU. I had 6 machines running what 4-5 machines could be doing, so a consolidation effort was in order. The process went fairly smoothly, but a small hiccup threw me for a frustrating loop.

    Migrating Argo

    Having configured Argo to manage itself, I was hoping that a move to a new cluster would be relatively easy. I was not disappointed.

    For reference:

    • The ops cluster is the cluster that was running Rancher and is now just running Argo and some standard cluster tools.
    • The internal cluster is the cluster that is running my monitoring stack and is the target for Argo in this migration

    Migration Process

    • I provisioned two new nodes for my internal cluster, and added them as workers to the cluster using the rke command line tool.
    • To prevent problems, I updated Argo CD to the latest version and disabled auto-sync on all apps.
    • Within my ops-argo repository, I changed the target cluster of my internal applications to https://kubernetes.default.svc. Why? Argo treats the cluster that it is installed a bit differently than external clusters. Since I was moving Argo on to my internal cluster, the references to my internal cluster needed to change.
    • Exported my ArgoCD resources using the Argo CD CLI. I was not planning on using the import, however, because I wanted to see how this process would work.
    • I modified my external Nginx proxy to point my Argo address from the ops cluster to the internal cluster. This essentially locked everything out of the old Argo instance, but let it run on the cluster should I need to fall back.
    • Locally, I ran helm install argocd . -n argo --create-namespace from my ops-argo repository. This installed Argo on my internal cluster in the argo namespace. I grabbed the newly generated admin password and saved it in my password store.
    • Locally, I ran helm install argocd-apps . -n argo from my ops-argo repository. This installs the Argo Application and Project resources which serve as the “app of apps” to bootstrap the rest of the applications.
    • I re-added my nonproduction and production clusters to be managed by Argo using the argocd cluster add command. As the URLs for the cluster were the same as they were in the old cluster, the apps matched up nicely.
    • I re-added the necessary labels to each cluster’s ArgoCD secret. This allows the cluster generator to create applications for my external cluster tool. I detailed some of this in a previous article, and the ops-argo repository has the ApplicationSet definitions to help.

    And that was it… Argo found the existing applications on the other clusters, and after some re-sync to fix some labels and annotations, I was ready to go. Well, almost…

    What happened to my metrics?

    Some of my metrics went missing. Specifically, many of those that were coming from the various exporters in my Prometheus install. When I looked at Mimir, the metrics just stopped right around the time of my Argo upgrade and move. I checked the local Prometheus install, and noticed that those targets were not part of the service discovery page.

    I did not expect Argo’s upgrade to change much, so I did not take notice of the changes. So, I was left with digging around into my Prometheus and ServiceMonitor instances to figure out why they are not showing.

    Sadly, this took way longer than I anticipated. Why? There were a few reasons:

    1. I have a home improvement project running in parallel, which means I have almost no time to devote to researching these problems.
    2. I did not have Rancher! One of the things I did use Rancher for was to easily view the different resources in the cluster and compare. I finally remembered that I could use Lens for a GUI into my clusters, but sadly, it was a few days into the ordeal.
    3. Everything else was working, I was just missing a few metrics. On my own personal severity level, this was low. On the other hand, it drove my OCD crazy.

    Names are important

    After a few days of hunting around, I realized that the ServiceMonitor’s matchLabels did not match the labels on the Service objects themselves. Which was odd, because, I had not changed anything in those charts, and they are all part of the Bitnami kube-prometheus Helm chart that I am using.

    As I poked around, I realized that I was using the releaseName property on the Argo Application spec. After some searching, I found the disclaimer on Argo’s website about using releaseName. As it turns out, this disclaimer describes exactly the issue that I was experiencing.

    I spent about a minute to see if I could fix it without removing the releaseName properties from my cluster tools, I realized that the easiest path was to remove that releaseName property from the cluster tools that used it. That follows the guidance that Argo suggests, and keeps my configuration files much cleaner.

    After removing that override, the ServiceMonitor resources were able to find their associated services, and showed up in Prometheus’ service discovery. With that, now my OCD has to deal with a gap in metrics…

    Missing Node Exporter Metrics
  • A Lesson in Occam’s Razor: Configuring Mimir Ruler with Grafana

    Occam’s Razor posits “Of two competing theories, the simpler explanation is to be preferred.” I believe my high school biology teacher taught the “KISS” method (Keep It Simple, Stupid) to convey a similar principle. As I was trying to get alerts set up in Mimir using the Grafana UI, I came across an issue that was only finally solved by going back to the simplest answer.

    The Original Problem

    One of the reasons I leaned towards Mimir as a single source of metrics is that it has its own ruler component. This would allow me to store alerts in, well, two places: the Mimir ruler for metrics rules, and the Loki ruler for log rules. Sure, I could use Grafana alerts, but I like being able to run the alerts in the native system.

    So, after getting Mimir running, I went to add a new alert in Grafana. Neither my Mimir nor Loki data source showed up.

    Now, as it turns out, the Loki one was easy: I had not enabled the Loki ruler component in the Helm chart. So, a quick update of the Loki configuration in my GitOps repo, and Loki was running. However, my Mimir data source was still missing.

    Asking for help

    I looked for some time at possible solutions, and posted to the GitHub discussion looking for assistance. Dimitar Dimitrov was kind enough to step and and give me some pointers, including looking at the Chrome network tab to figure out if there were any errors.

    At that point, I first wanted to bury my head in the sand, as, of all the things I looked at, that was not one of the things I had done. But, after getting over my embarrassment, I went about debugging. It looked like some errors around the Access-Control-Allow-Origin header not being available, although, there was also a 405 when attempting to make a preflight OPTIONS request.

    Dimitar suggested that I add that Access-Control-Allow-Origin via the Mimir Helm chart’s nginx.nginxConfig section. So I did, and but still had the same problems.

    I had previously tried this on Microsoft Edge and it had worked, and had been assuming it was just Chrome being overly strict. However, just for fun, I tried in a different Chrome profile, and it worked…

    Applying Occam’s Razor

    It was at this point that I thought “There is no way that something like this would not have come up with Grafana/Mimir on Chrome.” I would have expected a quick note around the Access-Control-Allow-Origin or something similar when hosting Grafana in a different subdomain than Mimir. So I took a longer look at the network log in my instance of Chrome that was still throwing errors.

    There was a redirect in there that was listed as “cached.” I thought, well, that’s odd, why would it cache that redirect? So, as a test, I disabled the cache in the Chrome debugging tools, refreshed the page, and viola! Mimir showed as a data source for alerts.

    Looking at the resulting success calls, I noted that ALL of the calls were proxied through grafana.mydomain.local, which made me think “Do I really even need the Access-Control-Allow-Origin headers? So, I removed those headers from my Mimir configuration, re-deployed, and tested with the caching disabled. It worked like a champ.

    What happened?

    The best answer I can come up with is that, at some point in my clicking around Grafana with Mimir as a data source, Chrome got a 301 redirect response from https://grafana.mydomain.local/api/datasources/proxy/14/api/v1/status/buildinfo, cached it, and used it in perpetuity. Disabled the response cache fixed everything, without the need to further configure Mimir’s Nginx proxy to return special CORS headers.

    So, with many thanks to Dimitar for being my rubber duck, I am now able to start adding new rules for monitoring my cluster metrics.

  • Walking away from Rancher

    I like to keep my home lab running for both hosting sites (like this one) and experimenting on different tools and techniques. It is pretty nice to be able to spin up some containers in a cluster without fear of disturbing other’s work. And, since I do not hold myself to any SLA commitments, it is typically stress-free.

    When I started my Kubernetes learning, having the Rancher UI at my side made it a little easier to see what was going on in the cluster. Sure, I could use kubectl for that, but having a user interface just made things a little nicer while I learned the tricks of kubectl.

    The past few upgrades, though, have not gone smoothly. I believe they changed the way they are handling their CRDs, and that has been messing with some of my Argo management of Rancher. But, as I went about researching a fix, I started to think…

    Do I really need Rancher?

    I came to realize that, well, I do not use Rancher’s UI nearly as often as I had. I rely quite heavily on the Rancher Kubernetes Engine (RKE) command line tool to manage my clusters, but, I do not really use any of the features that Rancher provides. Which features?

    • Ranchers RBAC is pretty useless for me considering that I am the only one connecting to the clusters, and I use direct connections (not through Rancher’s APIs).
    • Sure, I have single sign-on configured using Azure AD. But I do not think I have logged in to the UI in the last 6 months.
    • With my Argo CD setup, I manage resources by storing the definitions in a Git repository. That means I do not use Rancher for the add-on applications for observability and logging.

    This last upgrade cycle took the cake. I did my normal steps: upgrade the Rancher Helm version, commit, and let Argo do its thing. That’s when things blew up.

    First, the new rancher images are so big they cause an image pull error due to timeouts. So, after manually pulling the images to the nodes, the containers started up. The logs showed them working on the CRDs. The node would get through to the “m’s” of the CRD list before timing out and cycling.

    This wreaked havoc on ALL my clusters. I do not know if it was the constant thrashing to the disc or the server CPU, but things just started getting timing out. Once I realized Rancher was thrashing, and, well, I did not feel like fixing it, I started the process of deleting it.

    Removing Rancher

    After uninstalling the Rancher Helm chart, I ran the Rancher cleanup job that they provide. Even though I am using the Bitnami kube-prometheus chart to install Prometheus on each cluster, the cleanup job deleted all of my ServiceMonitor instances in each cluster. I had to go into Argo and re-sync the applications that were not set to auto-sync in order to recreate the ServiceMonitor instances.

    Benchmarking Before and After

    With Mimir in place, I had collected metrics from those clusters in a single place from both before and after the change. That made creating a few Grafana dashboard panels a lot easier.

    Cluster Memory and CPU – 10/18/2022 09:30 EDT to 10/21/2022 09:30 EDT

    The absolute craziness that started around 10:00 EDT on 10/19 and ended around 10:00 EDT on 10/20 can be attributed to me trying to upgrade Rancher, failing, and then starting to delete Rancher from the clusters. We will ignore that time period, as it is caused by human error.

    Looking at the CPU usage first, there were some unexpected behaviors. I expected a drop in the ops cluster, which was running the Rancher UI, but the drop in CPU in the production cluster was larger than that of the internal cluster. Meanwhile, nonproduction went up, which is really strange. I may dig into those numbers a bit more by individual pod later.

    What struck me more, was that, across all clusters, memory usage went up. I have no immediate answers to that one, but, you can be sure I will be digging in a bit more to figure that out.

    So, while Rancher was certainly using some CPU, it would seem that I am seeing increased memory usage without it. When I have answers to that, you’ll have answers to that.

  • Kubernetes Observability, Part 5 – Using Mimir for long-term metric storage

    This post is part of a series on observability in Kubernetes clusters:

    For anyone who actually reads through my ramblings, I am sure they are asking themselves this question: “Didn’t he say Thanos for long-term metric storage?” Yes, yes I did say Thanos. Based on some blog posts from VMWare that I had come across, my initial thought was to utilize Thanos as my long term metrics storage solution. Having worked with Loki first, though, I became familiar with the distributed deployment and storage configuration. So, when I found that Mimir has similar configuration and compatibility, I thought I’d give it a try.

    Additionally, remember, this is for my home lab: I do not have a whole lot of time for configuration and management. With that in mind, using the “Grafana Stack” seemed an expedited solution.

    Getting Started with Mimir

    As with Loki, I started with the mimir-distributed Helm chart that Grafana provides. The charts are well documented, including a Getting Started guide on their website. The Helm chart includes a Minio dependency, but, as I already setup Minio, I disabled the included chart and configured a new bucket for Mimir.

    As Mimir has all the APIs to stand in for Prometheus, the changes were pretty easy:

    1. Get an instance of Mimir installed in my internal cluster
    2. Configure my current Prometheus instances to remote-write to Mimir
    3. Add Mimir as a data source in Grafana.
    4. Change my Grafana dashboards to use Mimir, and modify those dashboards to filter based on the cluster label that is added.

    As my GitOps repositories are public, have a look at my Mimir-based chart for details on my configuration and deployment.

    Labels? What labels?

    I am using the Bitnami kube-prometheus Helm chart. That particular chart allows you to define external labels using the prometheus.externalLabels value. In my case, I created a cluster label with unique values for each of my clusters. This allows me to create dashboards with a single data source which can be filtered based on each cluster using dashboard variables.

    Well…. that was easy

    All in all, it took me probably two hours to get Mimir running and collecting data from Prometheus. It took far less time than I anticipated, and opened up some new doors to reduce my observability footprint, such as:

    1. I immediately reduced the data retention on my local Prometheus installations to three days. This reduced my total disk usage in persistent volumes by about 40 GB.
    2. I started researching using the Grafana Agent as a replacement for a full Prometheus instance in each cluster. Generally, the agent should use less CPU and storage on each cluster.
    3. Grafana Agent is also a replacement for Promtail, meaning I could remove both my kube-prometheus and promtail tools and replace them with an instance of the Grafana Agent. This greatly simplifies the configuration of observability within the cluster.

    So, what’s the catch? Well, I’m tying myself to an opinionated stack. Sure, it’s based on open standards and has the support of the Prometheus community, but, it remains to be seen what features will become “pay to play” within the stack. Near the bottom of this blog post is some indication of what I am worried about: while the main features are licensed under AGPLv3, there are additional features that come with proprietary licensing. In my case, for my home lab, these features are of no consequence, but, when it comes to making decisions for long term Kubernetes Observability at work, I wonder what proprietary features we will require and how much it will cost us in the long run.

  • Getting Synology SNMP data into Prometheus

    With my new cameras installed, I have been spending a lot more time in the Diskstation Manager (DSM). I always forget how much actually goes on within the Synology, and I am reminded of that every time I open the Resource Monitor.

    At some point, I started to wonder whether or not I could get this data into my metrics stack. A quick Google search brought me to the SNMP Exporter and the work Jinwoong Kim has done to generate a configuration file based on the Synology MIBs.

    It took a bit of trial and error to get the configuration into the chart, mostly because I forgot how to make a multiline string in a Go template. I used the prometheus-snmp-exporter chart as a base, and built out a chart in my GitOps repo to have this deployed to my internal monitoring cluster.

    After grabbing one of the community Grafana dashboards, this is what I got:

    Security…. Not just yet

    Apparently, SNMP has a few different versions. v1 and v2 are insecure, utilizing a community string for simple authentication. v3 has added a more complex authentication and encryption, but this makes configuration more complex.

    Now, my original intent was to enable SNMPv3 on the Synology, however, I quickly realized that meant putting the auth values for SNMP v3 in my public configuration repo, as I haven’t figured out how to include that configuration as a secret (instead of a configmap). So, just to get things running, I settled on just enabling V1/V2 with the community string. These SNMP ports are blocked from external traffic, so I’m not too concerned with it, but my to-do list now includes a migration to SNMPv3 for the Synology.

    When you have a hammer…

    … everything looks like a nail. The SNMP Exporter I am currently running only has configuration for Synology, but the default if_mib module that is packaged with the SNMP Exporter can grab data from a number of products which support it. So I find myself looking for “nails” as it were, or objects from which I can gather SNMP data. For the time being, I am content with the Synology data, but don’t be surprised if my network switch and server find themselves on the end of SNMP collection. I would say that my Unifi devices warrant it, but the Unifi Poller I have running takes care of gathering that data.

  • Camera 1, Camera 2…

    First: I hate using acronyms without definitions:

    • NAS – Network Attached Storage. Think “network hard drive”
    • SAN – Storage Area Network – Think “a network of hard drives”. Backblaze explains the differences nicely.
    • iSCSI Target – iSCSI is a way to mount a volume on a NAS or SAN to a server in a way that the server thinks the volume is local.

    When I purchased my Synology DS1517+ in February 2018, I was immediately impressed with the capabilities of the NAS. While the original purpose was to have redundant storage for photos and an iSCSI target for my home lab server, the Synology has quickly expanded its role in my home network.

    One day, as I was browsing the different DiskStation apps, Surveillance Station caught my eye. Prior to this, I had never really thought about getting a video surveillance system. Why?

    • I never liked the idea of one of the closed network systems, and the expense of that dedicated system made it tough to swallow.
    • I really did not like the idea of my video being “in the cloud.” Perhaps that is more paranoia talking, but when I am recording everything that happens around my house, I want a smaller threat surface area.

    So what changed my mind? After shelling out the coin for my Synology and the associated drives, I felt the need to get my money’s worth out of the system. Plus, the addition of a pool in the back yard makes me want to have some surveillance on it for liability purposes.

    Years in the making

    Let’s just say I had no urgency with this project. After purchasing the NAS, a bevy of personal and professional events came up, including, but not limited to, a divorce, a global pandemic, wedding planning (within a global pandemic), and home remodeling.

    When those events started to subside in late 2021, I started shopping different Synology-compatible cameras. As you can see, the list is extensive and not the easiest shopping list. So I defaulted to my personal technical guru (thanks Justin Lauck), who basically runs his own small business at home in addition to his day job, for his recommendations. He turned me on to these Reolink Outdoor Cameras. I pulled the trigger on those bullet cameras (see what I did there) in December of 2021.

    I had no desire to climb a ladder outside in the winter in Pittsburgh, so I spent some time in the winter preparing the wire runs inside. I already had an ethernet run to the second floor for one of my wireless access points, but decided it would be easiest to replace that run with two new lines of fresh Cat6. One went to the AP, the other to a small POE switch I put in the attic.

    The spring and summer brought some more major life events (just a high school graduation and a kid starting college… nothing major). The one day I tried to start, I realized that my ladder was not tall enough to get where I wanted to go. That, coupled with me absolutely not wanting to be 30′ up in the air, led me to delay a bit.

    Face your fears!

    Faced with potentially one of the last good weather weeks of the year, yesterday was as good a time as any to get a ladder and get to work. I rented a 32′ extension ladder from my local Sunbelt rentals and got to work. For whatever reason, I started with the highest point. The process was simple, albeit with a lot of “up and down” the ladder.

    1. Drill a small hole in the soffit, and feed my fish stick into the attic.
    2. Get up into the attic and find the fish stick.
    3. Tape the ethernet cable to the end of the stick.
    4. Get back up on the ladder and fish the cable outside.
    5. Install the camera
    6. Check the Reolink app on my phone (30′ up) to ensure the camera angle is as I want it to be.

    Multiple that process times 4 cameras (one on each corner of the house), and I’m done! Well, done with the hardware install.

    Setting up Surveillance Station

    After the cameras were installed and running, I started adding the cameras to Surveillance Station. The NAS comes with a license for two cameras, meaning I could only add two of my four cameras initially. I thought I could just purchase a digital code for a 4 pack, but realized that all the purchases send a physical card, meaning I have to wait for someone to ship me a card. Since I had some loyalty points at work, I ordered some Amazon gift cards which I’ll use towards my 4 camara license pack. So, for now, I only have two cameras in Surveillance Station.

    There are so many options for setup and configuration that for me to try to cover them all here would not do it justice. What I will say is that initial configuration was a breeze, and if you take the time to get one camera configured to your liking, you can easily copy settings from that camera to other cameras.

    As for settings, here’s just a sampling of what you can do:

    • Video Stream format: If your camera is compatible with Surveillance Station, you can assign streams from your camera to profiles in Surveillance Station. My Reolink’s support a low-quality stream (640×480) and a high-quality stream (2560×1440), and I can map those to different Surveillance Station profiles to change the quality of recording. For example, you can setup the Surveillance Station to record at low quality when no motion is detected, and high quality when motion is detected.
    • Recording Schedule: You can customize the recording schedule based on time of day.
    • On Screen Display Settings – You can control overlays (like date/time and camera name) to use the camera settings or via Surveillance Station. The latter is nice in that it allows you to easily copy settings across cameras.
    • Event Detection – Like On Screen Display, you can use the camera’s built-in event detection algorithm, or use Surveillance Station’s algorithm. This one I may test more: I have to assume that Surveillance Station would use more CPU on the Synology when this is set to use the Surveillance Station algorithm, as opposed to letting the camera hardware detect motion. For now, I am using the camera’s built-in algorithms.

    As I said, there are lots of additional features, including notifications, time lapse recording, and privacy screening. All in all, Surveillance Status turns your Synology into a fully functioning network video recorder.

    Going mobile

    It’s worth noting that the DS Cam app allows you to access your surveillance station video outside of your home, as well as register for push notifications to be sent to you phone for various events. The app itself is pretty easy to use, and since it uses the built-in Diskstation authentication, there is no need to duplicate user values across the systems.

    Future Expansion

    Once I get the necessary license for the other two cameras, I plan on letting the current setup ride to see what I get. However, I do have a plan for two more cameras:

    1. Inside my garage facing the garage door. This would allow me to catch anyone coming in via the garage, which, right now, I only get as a side view.
    2. On my shed facing the back of my home. This would give me a full view of the back of the house. This one, though, comes with a power issue that may end up getting solved with a small off-grid solar solution.

    Additionally, I am going to research how this can tie in with my home automation platform. It might be nice to have lights kick on when motion is detected outside.

    For now, though, I will be happy knowing that I am more closely monitoring what goes on outside my home.

  • Tech Tips – Upgrading your Argo cluster tools

    Moving my home lab to GitOps and ArgoCD has been, well, nearly invisible now. With the build pipelines I have in place, I’m able to work on my home projects without much thought to deploying changes to my clusters.

    My OCD, however, prevents me from running old versions. I really want to stay up-to-date when it comes to the tools I am using. This resulted in a periodic update of all of my Helm repositories to search for new versions.

    After getting tired of this “search and replace,” I wrote a small Powershell script which automatically updates my local Helm repositories, and then searches through the current directory (recursively) for chart.yaml files. If there are dependencies in that Chart.yaml, it will upgrade them to the latest version and save the Chart.yaml file.

    It does not, at the moment, automatically commit these changes, so I do have a manual step to confirm the upgrades. However, it is a much faster way to get changes staged for review and committing.

  • I’ll take ‘Compatibility Matrices’ for $1000, Alex…

    I have posted a lot on Kubernetes over the past weeks. I have covered a lot around tools for observability, monitoring, networking, and general usefulness. But what I ran into over the past few weeks is both infuriating and completely avoidable, with the right knowledge and time to keep up to speed.

    The problem

    I do my best to try and keep things up to date in the lab. Not necessarily bleeding edge, but I make an effort to check for new versions of my tools and update them. This includes the Kubernetes versions of my clusters.

    Using RKE, I am limited to what RKE supported. So I have been running 1.23.6 (or, more specifically, v1.23.6-rancher1-1) for at least a month or so.

    About two weeks ago, I updated my RKE command line and noticed that v1-23.8-rancher1-1 was available. So I changed the cluster.yml in my non-production environment and ran rke up. No errors, no problems, right?? So I made the change to the internal cluster and started the rke up for that cluster. As that was processing, however, I noticed that my non-production cluster was down. Like, the Kube API was running, so I could get the pod list. But every pod was erroring out. I did not notice anything in the pod events that was even remotely helpful, other than it couldn’t be scheduled. As I didn’t have time to properly diagnose, I attempted to roll the cluster back to 1.23.6. That worked, so I downgraded and left it alone.

    I will not let the machines win, so I stepped back into this problem today. I tried upgrading again (both to 1.23.8 and 1.24.2), with the same problems. In digging into the docker container logs on the hosts, I found the smoking gun:

    Unit kubepods-burstable.slice not found.

    Eh? I can say I have never seen that phrase before. But a quick Google search pointed me towards cgroup and, more generally, the docker runtime.

    Compatibility, you fickle fiend

    As it turns out, there is quite a large compatibility matrix between Rancher, RKE, Kubernetes, and Docker itself. My move to 1.23.8 clearly pushed my Kubernetes version past what my Docker version is (which, if you care, was Docker 20.10.12 on Ubuntu 22.04).

    I downgraded the non-production cluster once again, then upgraded Docker on those nodes (a simple sudo apt upgrade docker). Then I tried the upgrade, first to v1.23.8 (baby steps, folks).

    Success! Kubernetes version was upgraded, and all the pods restarted as expected. Throwing caution to the wind, I upgraded the cluster to 1.24.2. And, this time, no problems.

    Lessons Learned

    Kubernetes is a great orchestration tool. Coupled with the host of other CNCF tooling available, one can design and build robust application frameworks which let developers focus more on business code than deployment and scaling.

    But that robustness comes with a side of complexity. The host of container runtimes, Kubernetes versions, and third party hosts means cluster administration is, and will continue to be, a full time job. Just look at the Rancher Manager matrix at SUSE. Thinking about keeping production systems running while jugging the compatibility of all these different pieces makes me glad that I do not have to manage these beasts daily.

    Thankfully, cloud providers like Azure, GCP, and AWS, provide some respite by simplifying some of this. Their images almost force compatibility, so that one doesn’t run into the issues that I ran into on my home cluster. I am much more confident in their ability to run a production cluster than I am in my own skills as a Kubernetes admin.