Category: Technology

  • Home Lab: Disaster Recovery and time for an Upgrade!

    I had the opportunity to travel the U.S. Virgin Islands last week on a “COVID delayed” honeymoon. I have absolutely no complaints: we had amazing weather, explored beautiful beaches, and got a chance to snorkel and scuba in some of the clearest water I have seen outside of a pool.

    Trunk Bay, St. John, US VI

    While I was relaxing on the beach, Hurricane Ida was wrecking the Gulf and dropping rain across the East, including at home. This lead to power outages, which lead to my home server having something of a meltdown. And in this, I learned the value of good neighbors who can reboot my server and the cost of not setting up proper disaster recovery in Kubernetes.

    The Fall

    I was relaxing in my hotel room when I got text messages from my monitoring solution. And, at first, I figured “The power went out, things will come back up in 30 minutes or so.” But, after about an hour, nothing. So I texted my neighbor and asked if he could reset the server. After he reset, most of my sites came back up, with the exception of some of my SQL-dependent sites. I’ve had some issues with SQL Server’s not starting their service correctly, so some sites were down… but that’s for another day.

    A few days later, I got the same monitoring alerts. My parents were house-sitting, so I had my mom reset the server. Again, most of my sites came up. Being in America’s Caribbean Paradise, I promptly forgot all about it, figuring things were fine.

    The Realization

    Boy, was I wrong. When I sat down at the computer on Sunday, I randomly checked my Rancher instance. Dead. My other clusters were running, but all of the clusters were reporting issues with etcd on one of the nodes. And, quite frankly, I was at a loss. Why?

    • While I have taken some Pluralsight courses on Kubernetes, I was a little overly dependent on the Rancher UI to manage Kubernetes. With it down, I was struggling a bit.
    • I did not take the time to find and read the etcd troubleshooting steps for RKE. Looking back, I could most likely have restored the etcd snapshots and been ok. Live and learn, as it were.

    Long story short, my hacking attempts to get things running again pretty much killed my rancher cluster and made a mess of my internal cluster. Thankfully, the stage and production clusters were still running, but with a bad etcd node.

    The Rebuild

    At this point, I decided to throw in the towel and rebuild the Rancher cluster. So I deleted the existing rancher machines and provisioned a new cluster. Before I installed Rancher, though, I realized that my old clusters might still try to connect to the new Rancher instance and cause more issues. I took the time to remove the Rancher components from the stage and production clusters using the documented scripts.

    When I did this with my internal tools cluster… well, it made me realize there was a lot of unnecessary junk on that cluster. It was only running ELK (which was not even fully configured) and my Unifi Controller, which I moved to my production box. So, since I was already rebuilding, I decided to decommission my internal tools box and rebuild that as well.

    With two brand new clusters and two RKE clusters clean of Rancher components, I installed Rancher and got all the management running.

    The Lesson

    From this little meltdown and reconstruction I have learned a few important lessons.

    • Save your etcd snapshots off machine – RKE takes snapshots of your etcd regularly, and there is a process for restoring from a snapshot. Had I known that those snapshots were there, I would have tried this before killing my cluster.
    • etcd is disk-heavy and “touchy” when it comes to latency – My current setup has my hypervisor using my Synology as an iSCSI disk for all my VMs. With 12 VMs running as Kubernetes nodes, any disruption in either network or I/O lag can cause leader changes. This is minimal during normal operation, but performing deployments or updates can sometimes cause issues. I have a small upgrade planned for the Synology to add a 1 TB SSD Read/Write cache which hopefully improves the issue, but I may end up creating a new subnet for iSCSI traffic to alleviate network hiccups.
    • Slow and steady wins the race – In my haste to get everything working again, I tried some things that did more harm than good. Had I done a bit more research and found the appropriate articles earlier, I probably could have recovered without rebuilding.

  • Simple Site Monitoring with Raspberry PI and Python

    My off-hours time this week has been consumed by writing some Python scripts to help monitor uptime for some of my sites.

    Build or Buy?

    At this point in my career, “build or buy” is a question I ask more often than not. As a software engineer, there is no shortage of open source and commercial solutions for almost any imaginable task. Web Site monitoring is no different. Tools such as StatusCake, Pingdom, and LogicMonitor offer hosted platforms, while tools like Nagios and PRTG offer on-premise installations, there is so much to choose from, it’s hard to decide.

    I had a few simple requirements:

    • Simple at first, but expandable as needed.
    • Runs on my own network so that I can monitor sites that are not available outside of my firewall.
    • Since most of my servers are virtual machines consolidated on the one lab server, well, it does not make much sense to put it on that server. I needed something I could run on easily with little to no power.

    Build it is!

    I own a few Raspberry Pis, but the Model 3B and 4B are currently in use. The lone unused Pi is an old Model B (i.e., model 1B), so installing something like Nagios would be, well, not usable when it was all said and done. Given that the Raspberry Pi is at home with Python, I thought I would dust off my “language learning skills” and figure out how to make something useful.

    As I started, though, I remembered my free version of Atlassian’s Status Page. Although the free version is limited in the number of subscribers and no text subscriptions, for my usage, it’s perfect. And, near and dear to my developer heart, it has a very well defined set of APIs for managing statuses and incidents.

    So, with Python and some additional modules, I created a project that lets me do a quick request on a desired website. If the website is down, the Status Page status for the associated component is changed and an incident is created. If/when it comes back up, any open incidents associated with that component are closed.

    Viola!

    After a few evening hours tinkering, I have the scripts doing some basic work. For now, a cron job executes the script every 5 minutes, and if a site goes down it is reported to my statuspage site.

    Long term, I plan on adding support for more in-depth checks of my own projects, which utilized .Net’s HealthChecks namespace to report service health automatically. I may also look into setting up the scripts as a service running on the Pi.

    If you are interested, the code is shared on Github.

  • Hardening your Kubernetes Cluster: Don’t run as root!

    People sometimes ask my why I do not read for pleasure. As my career entails ingesting the NSA/CISA technical report on Kubernetes Hardening Guidance and translating it into actionable material, I ask that you let me enjoy hobbies that do not involve the written word.

    The NSA/CISA technical report on came out on Tuesday, August 3. Quite coincidentally, my colleagues and I have started asking questions about our internal standards for securing Kubernetes clusters for production. Coupled with my current home lab experimentation, I figured it was a good idea to read through this document and see what I could do to secure my lab.

    Hopefully, I will get through this document and how I’ve applied everything to my home lab (or, at least, the production cluster in my home lab). For now, though, I thought it prudent to look at the Pod Security section. And, as one might expect, the first recommendation is…

    Don’t run as root!

    For as long as I can remember working in Linux, not running as root was literally “step one.” So it amazes me that, by default, containers are configured to run as root. All of my home-built containers are pretty much the same: simple docker files that copy the results of an external dotnet publish command into the container and then run the dotnet entry point.

    My original docker files used to look like this:

    FROM mcr.microsoft.com/dotnet/aspnet:5.0-focal AS base
    WORKDIR /app
    COPY . /app
    
    EXPOSE 80
    ENTRYPOINT ["dotnet", "my.dotnet.dll"]

    With some StackOverflow assistance, now, it looks something like this:

    FROM mcr.microsoft.com/dotnet/aspnet:5.0-focal AS base
    WORKDIR /app
    COPY . /app
    # Create a group and user so we are not running our container and application as root and thus user 0 which is a security issue.
    RUN addgroup --system --gid 1000 mygroup \
        && adduser --system --uid 1000 --ingroup mygroup --shell /bin/sh nmyuser
      
    # Serve on port 8080, we cannot serve on port 80 with a custom user that is not root.
    ENV ASPNETCORE_URLS=http://+:8080
    EXPOSE 8080
      
    # Tell docker that all future commands should run as the appuser user, must use the user number
    USER 1000
    
    ENTRYPOINT ["dotnet", "my.dotnet.dll"]

    What’s with the port change?

    The docker file changes are pretty simple: add the commands to add a new group and a new user, and using the USER command, tell the docker file to execute as the new user. But why change the ASPNETCORE_URLS and port change? When running as a non-root user, you are restricted to ports above 1024, so I had to change the exposed port. This necessitated some changes to my helm charts and their service deployment, but, overall, the process was straightforward.

    My next steps

    When I find some spare time in next few months, I’m going to revisit Pod Security Policies and, more specifically, it’s upcoming replacement: PSP Replacement Policy. I find it amusing that the NSA/CISA released guidelines that specify usage of what is now a deprecated feature. I also find it typical of our field that our current name for the new version is, quite simply, “Pod Security Policy Replacement Policy.” I really hope they get a better name for that…

  • Moving the home lab to Kubernetes

    Kubernetes has become something of the standard for the orchestration of containers. While there are certainly other options, the Kubernetes platform remains the most prevalent. With that in mind, I decided to migrate my home lab from docker servers to Kubernetes clusters.

    Before: Docker Servers

    Long story short: my home lab has transitioned from Windows servers running IIS to a mix of Linux and Windows containers to Linux only containers. The apps are containerized, but the DBs still run on some SQL servers.

    Build and deployment is automated: Build through Azure DevOps Pipelines & Self Hosted Agents (Teamcity before that), and deployment through Octopus Deploy. Container images for my projects live on a Proget server feed.

    The Plan

    “Consolidate” (and I’ll tell you later why that is in quotes) my servers into Kubernetes Clusters. It seemed an easy plan.

    • Internal K8 Cluster – Runs Rancher and any internal tooling (including Elastic/Kibana) I want to be there, but not available externally
    • Non Production K8 Cluster – Runs my *.net and *.org sites, used for test and staging environments
    • Production K8 Cluster – Runs my *.com sites (external) including any external tooling.

    I spent some time learning Packer to provision Hyper-V vms for my clusters. The clusters all ended up with a control plane (4vCPU, 8GB RAM) and two workers (2vCPU, 6GB RAM).

    The Results

    The Kubernetes Clusters

    There was a LOT of trial and error in getting Kubernetes going, particularly with Rancher. So much, in fact, that I probably provisioned the clusters 3 or 4 times each because I felt like I messed up and wanted to do it over again.

    Initially, I tried to manually provision the K8 cluster. Yes, it worked.. but RKE is nicer. And, after my manually provisioned K8 cluster went down, I provisioned the internal cluster with RKE. That makes updates easier, as I have the config file.

    I provisioned the non-production and production clusters using the Rancher GUI. However, that was the “manually provisioned” cluster, so, when it went down, I lost the config files. I currently have two clusters which look like “imported” clusters in Rancher, so they are harder to manage through the Rancher GUI.

    Storage

    In order to utilize persistent volume claims, I configured NFS on my Synology and installed the nfs-subdir-external-provisioner in all of my clusters. It installs a storage class which can be used in persistent volume claims, and will provision directories in my NFS.

    Ingress

    Right now, I’m using the Nginx Ingress controller from Rancher. I haven’t played with it much, other than the basics. Perhaps more on that when I dig in.

    Current Impressions

    Rancher

    It works… but mine is flaky. I think it may be due to some resource starvation. I may try to provision a new internal cluster with better VMs and see how that works.

    I do like the deployment of clusters using RKE, however, I can see how it would be difficult to manage when there is more than one person involved.

    Kubernetes

    Once it was running, it’s great: creating new APIs or apps and getting them running in a scalable fashion is easy. Helm charts make deployment and updating a snap.

    That said, I would not trust myself to run this in production without a LOT more training.

    References

  • Ubuntu and Docker…. Oh Snap!

    A few months ago, I made the decision to start building my my .NET Core side projects from as Linux-based containers instead of Windows-based containers.  These projects are mostly CRUD APIs, meaning none of them require the Windows based containers.  And, quite frankly, Linux is cheaper….

    Now, I had previously built out a few Ubuntu servers with the express purpose of hosting my Linux containers, and changing the Dockerfile was easy enough.  But I ran into a HUGE roadblock in trying to get my Octopus deployments to work.

    I was able to install Octopus Tentacles just fine, but I could NOT get the tentacle to authenticate to my private docker repository.  There were a number of warnings and errors around the docker-credential-helper and pass, and, in general, I was supremely frustrated.  I got to a point where I uninstalled everything but docker on the Ubuntu box, and that still didn’t work.  So, since it was a development box, I figured there would be no harm in uninstalling Docker…  And that is where things got interesting.

    When I provision my Ubuntu VMs, I typically let the Ubuntu setup install docker.  It does this through snaps, which, as I have seen, have some weird consequences.  One of those consequences seemed to be a weird interaction between docker login and non-interactive users.  The long and short of it was, after several hours of trying to figure out what combination of docker-credential-helper packages and configurations were required, removing EVERYTHING and installing Docker via apt (and docker-compose via a straight download from github) made everything work quite magically.

    While I would love nothing more than to recreate the issue and figure out why it was occurring, frankly, I do not have the time.  It was easier for me to swap my snap-installed Docker packages for apt-installed ones and move forward with my project.

  • Polyglot-on for Punishment

    I apologize for the misplaced pun, but Polyglot has been something of a nemesis of mine over the last year or so. The journey started with my Chamberlain MyQ-compatible garage door and a desire to control it through my ISY. Long story short (see here and here for more details), running Polyglot on a docker image on my VM docker hosts is more of a challenge than I’m willing to tackle. Either it’s a networking issues between the container and the ISY or I’m doing something wrong, but in any case, I have resorted to running Polyglot on an extra Raspberry Pi that I have at home

    Why the restart? Well, I recently renovated my master bathroom, and as part of that renovation i installed RGBW LED strips controlled by a MiLight controller. There is a node server for the MiLight WiFi box I purchased, so, in addition to the Insteon switches for the lights in that bathroom, I can now control the electronics in the bathroom through my ISY.

    While it would be incredibly nice to have all this running in a docker container, I’m not at all upset that I got it running on the Pi. My next course of action will be to restart the development of my HA Kiosk app…. Yes, I know there are options for apps to control the ISY, but I want a truly customized HA experience, and for that, I’d like to write my own app.

  • Supporting Teamcity Domain Authentication

    TLDR: TeamCity in Linux (or in a Linux Docker container) only supports SMBv1. Make sure you enable the SMB 1.0/CIFS File Sharing Support feature on your domain controllers.

    A few weeks ago, I decided it was time to upgrade my domain controllers. On a Hypervisor with space, it seemed a pretty easy task. My abbreviated steps were something like this:

    1. Remove the backup DC and shut the machine down.
    2. Create a new DC, add it to the domain, and replicate
    3. Give the new DC the master roles
    4. remove the old primary DC from the domain and shut it down
    5. Create a new backup DC and add it to the domain.

    Seems easy, right? Except that, during step 4, the old primary DC essentially gave up. I was forced to remove it from the domain manually.

    Also, while i was able to change the DHCP settings to reassign DNS servers for the clients which get their IP via DHCP, the machines with static IP addresses required more work to reset the DNS settings. But, after a bit of a struggle, I got it working.

    Except that I couldn’t log in to TeamCity using my domain credentials any more. I did some research, and, on Linux, TeamCity only supports SMBv1, not SMB2. So I installed the SMB 1.0/CIFS File Sharing Support feature on both domain controllers and that fixed my authentication issues.

  • MS Teams Notifications Plugin

    I have spent the better part of my last 20 years working on software in one form or another. During that time, it’s been impossible to avoid open source software components in one form or another.

    I have not, until today, contributed back to that community in a large way. Perhaps I’ve suggested a change here or there, but never really took the opportunity to get involved. I suppose my best excuse is that I have a difficult time finding a spot to jump in and get to work.

    About two years ago, I ported a Teamcity Plugin for Slack notifications to get it to work with Microsoft Teams. It’s been in use at my current company since then, and it has a few users who happened to have found it on GitHub. I took the step today to publish this plugin to Jetbrains’ plugin library.

    So, here’s to my inaugural open source publication!

  • Windows docker containers for TeamCity Build Agents

    I have been using TeamCity for a few years now, primarily as a build tool for some of our platforms at work.  However, because I like to have a sandbox in which to play, I have been hosting an instance of TeamCity at home for roughly the same amount of time. 

    At first, I went with a basic install on a local VM and put the database on my SQL Server, and spun up another VM (a GUI Windows Server instance) which acted as my build agent.  I installed a full-blown copy of Visual Studio Community on the build agent, which provided me the ability to pretty much run any build I wanted.

    As some of my work research turned me towards containers, I realized that this setup is probably a little to heavy, and running some of my support systems (TeamCity, Unifi, etc) in docker makes them much easier to manage and update.

    I started small, with a Linux (Ubuntu) docker server running the Unifi Controller software and the TeamCity server containers.  Then this blog, which is hosted on the same docker server.  And then, as quickly as I had started, I quit containerizing.  I left the VM with the build agent running, and it worked. 

    Then I got some updated hardware, specifically a new hypervisor.  I was trying to consolidate VMs onto the new hypervisor, and for one reason or another, the build VM did not want to move nicely.  Whether the image was corrupt or what, I don’t know, but it was stuck.   So I took the opportunity to jump into Docker on Windows (or Windows Containers, or whatever you want to call it).

    I was able to take the base docker image that JetBrains provides and add the MSBuild tools to it. That gave me an image that I could use to run as a TeamCity Build agent. You can see my Dockerfile and docker-compose.yml files in my infrastructure repository.

  • Polyglot v2 and Docker – Success!

    I can’t believe it was 6 months ago that I first tried (and failed) to get the Polyglot v2 server from Universal Devices running on Docker. Granted, the problem had a workaround (put it on a Raspberry Pi), so I ignored the problems and just let my Pi do the work.

    But, well, I needed that Pi, so, this issue reared it’s ugly head again. Getting back into it, I remembered that the application kept getting stuck on this error message:

    Auto Discovering ISY on local network.....

    My presumption is that something about the auto-discovery did not like the Docker network, and promptly puked a bit. To try and get past that, I set the ISY_HOSTISY_PORT, and ISY_HTTPS environment variables in the docker-compose file. However, the portion of the code that skips the autodetection of the ISY host doesn’t look at the environment variables in docker: it looks for environment variables stored in the .env file in ~/.polyglot. In the docker environment using that particular docker compose file, i wasn’t able to make it work, because there’s no mounted volume.

    The quick way out was to add this to the docker-compose.yml file:

    volumes:
      - /path/on/host:/root/.polyglot

    and then, put a .env file on your host (/path/on/host/.env) and set your ISY_HOST, ISY_PORT, and ISY_HTTPS values in the .env file. It should skip the auto discovery.

    Also, by default, the container wants to run in HTTPS (https://localhost:3000). Since i use SSL offloading, i turned that off (USE_HTTPS=false in the .env file)

    Here are my new Dockerfile and docker-compose.yml files. Note a few differences:

    1. Based the build off of ubuntu/trusty instead of debian/stretch. You can use whatever you like there, although if you are using a different architecture, the link to download the binaries will have to change.
    2. Created a new directory on the host (/var/polyglot) and created a symbolic link from /root/.polyglot to /var/polyglot.
    3. Added a volume at /var/polyglot in the Dockerfile
    4. Mapped volumes on both the MongoDB service and the Polyglot service (/var/polyglot) to preserve data across compose up/down.
    5. My .env file looks like this
    ISY_HOST=my.isy.address
    ISY_PORT=80
    ISY_HTTPS=false 
    USE_HTTPS=false