• Home Lab: Disaster Recovery and time for an Upgrade!

    I had the opportunity to travel the U.S. Virgin Islands last week on a “COVID delayed” honeymoon. I have absolutely no complaints: we had amazing weather, explored beautiful beaches, and got a chance to snorkel and scuba in some of the clearest water I have seen outside of a pool.

    Trunk Bay, St. John, US VI

    While I was relaxing on the beach, Hurricane Ida was wrecking the Gulf and dropping rain across the East, including at home. This lead to power outages, which lead to my home server having something of a meltdown. And in this, I learned the value of good neighbors who can reboot my server and the cost of not setting up proper disaster recovery in Kubernetes.

    The Fall

    I was relaxing in my hotel room when I got text messages from my monitoring solution. And, at first, I figured “The power went out, things will come back up in 30 minutes or so.” But, after about an hour, nothing. So I texted my neighbor and asked if he could reset the server. After he reset, most of my sites came back up, with the exception of some of my SQL-dependent sites. I’ve had some issues with SQL Server’s not starting their service correctly, so some sites were down… but that’s for another day.

    A few days later, I got the same monitoring alerts. My parents were house-sitting, so I had my mom reset the server. Again, most of my sites came up. Being in America’s Caribbean Paradise, I promptly forgot all about it, figuring things were fine.

    The Realization

    Boy, was I wrong. When I sat down at the computer on Sunday, I randomly checked my Rancher instance. Dead. My other clusters were running, but all of the clusters were reporting issues with etcd on one of the nodes. And, quite frankly, I was at a loss. Why?

    • While I have taken some Pluralsight courses on Kubernetes, I was a little overly dependent on the Rancher UI to manage Kubernetes. With it down, I was struggling a bit.
    • I did not take the time to find and read the etcd troubleshooting steps for RKE. Looking back, I could most likely have restored the etcd snapshots and been ok. Live and learn, as it were.

    Long story short, my hacking attempts to get things running again pretty much killed my rancher cluster and made a mess of my internal cluster. Thankfully, the stage and production clusters were still running, but with a bad etcd node.

    The Rebuild

    At this point, I decided to throw in the towel and rebuild the Rancher cluster. So I deleted the existing rancher machines and provisioned a new cluster. Before I installed Rancher, though, I realized that my old clusters might still try to connect to the new Rancher instance and cause more issues. I took the time to remove the Rancher components from the stage and production clusters using the documented scripts.

    When I did this with my internal tools cluster… well, it made me realize there was a lot of unnecessary junk on that cluster. It was only running ELK (which was not even fully configured) and my Unifi Controller, which I moved to my production box. So, since I was already rebuilding, I decided to decommission my internal tools box and rebuild that as well.

    With two brand new clusters and two RKE clusters clean of Rancher components, I installed Rancher and got all the management running.

    The Lesson

    From this little meltdown and reconstruction I have learned a few important lessons.

    • Save your etcd snapshots off machine – RKE takes snapshots of your etcd regularly, and there is a process for restoring from a snapshot. Had I known that those snapshots were there, I would have tried this before killing my cluster.
    • etcd is disk-heavy and “touchy” when it comes to latency – My current setup has my hypervisor using my Synology as an iSCSI disk for all my VMs. With 12 VMs running as Kubernetes nodes, any disruption in either network or I/O lag can cause leader changes. This is minimal during normal operation, but performing deployments or updates can sometimes cause issues. I have a small upgrade planned for the Synology to add a 1 TB SSD Read/Write cache which hopefully improves the issue, but I may end up creating a new subnet for iSCSI traffic to alleviate network hiccups.
    • Slow and steady wins the race – In my haste to get everything working again, I tried some things that did more harm than good. Had I done a bit more research and found the appropriate articles earlier, I probably could have recovered without rebuilding.

  • Inducing Panic in a Software Architect

    There are many, many ways to induce panic in people, and, by no means will I be cluing you in to all the different ways you can succeed in sending me into a tail spin. However, if there is one thing that anyone can do that immediately has me at a loss for words and looking for the exit, it is to ask this one question: “What do you do for a living?”

    It seems a fairly straight-forward question, and one that should have a relatively straight forward answer. If I said “I am a pilot,” then it can be assumed that I fly airplanes in one form or another. It might lead to a conversation about the different types of planes I fly, whether it’s cargo or people, etc. However, answering “I am a software architect” usually leads to one of two responses:

    1. “Oh..”, followed by a blank stare from the asker and an immediate topic change.
    2. “What’s that?”, followed by a blank stare from me as I try to encapsulate my various functions in a way that does not involve twelve Powerpoint slides and a scheduled break.

    In social situations, or, at the very least, the social situations in which I am involved, being asked what I do is usually an immediate buzz kill. While I am sure people are interested in what I do, there is no generic answer. And the specific answers are typically so dull and boring that people lose interest quickly.

    Every so often, though, I run across someone in the field, and the conversation turns more technical. Questions around favorite technology stacks, innovate work in CI/CD pipelines, or some sexy new machine learning algorithms are discussed. But for most, describing the types of IT Architects out there is confusing because, well, even we have trouble with it.

    IT Architect Types

    Redhat has a great post on the different types of IT architects. They outline different endpoints of the software spectrum and how different roles can be assigned based on these endpoints. From those endpoints they illustrate the different roles an architect can play, color-coordinated to the different orientations along the software spectrum.

    However, only the largest of companies can afford to confine their architects to a single circle in this diagram, and many of us where one or more “role hats” as we progress through our daily work.

    My architecture work to this point has been primarily developer-oriented. While I have experimented in some of operations-oriented areas, my knowledge and understanding lies primarily in the application realm. Prior to my transfer to an architecture role, I was an engineering manager. This previous role exposed me to much more of the business side of things, and my typical frustrations today are more about what we as architects can and should be doing to support the business side of things.

    So what do I do?

    In all honesty, I used to just say “software developer” or “software engineer.” Those titles are usually more generally understood, and I can be very generic about it. But as I work to progress in my own career, the need for me to be able to articulate my current position (and desired future positions) becomes critical.

    Today, I try to phrase my answers around being a leader in delivering software that helps customers do their job better. It is usually not as technical, and therefore not as boring, but does drive home the responsibilities of my position as a leader.

    How that answer translates to a cookout, well, that always remains to be seen.

  • Simple Site Monitoring with Raspberry PI and Python

    My off-hours time this week has been consumed by writing some Python scripts to help monitor uptime for some of my sites.

    Build or Buy?

    At this point in my career, “build or buy” is a question I ask more often than not. As a software engineer, there is no shortage of open source and commercial solutions for almost any imaginable task. Web Site monitoring is no different. Tools such as StatusCake, Pingdom, and LogicMonitor offer hosted platforms, while tools like Nagios and PRTG offer on-premise installations, there is so much to choose from, it’s hard to decide.

    I had a few simple requirements:

    • Simple at first, but expandable as needed.
    • Runs on my own network so that I can monitor sites that are not available outside of my firewall.
    • Since most of my servers are virtual machines consolidated on the one lab server, well, it does not make much sense to put it on that server. I needed something I could run on easily with little to no power.

    Build it is!

    I own a few Raspberry Pis, but the Model 3B and 4B are currently in use. The lone unused Pi is an old Model B (i.e., model 1B), so installing something like Nagios would be, well, not usable when it was all said and done. Given that the Raspberry Pi is at home with Python, I thought I would dust off my “language learning skills” and figure out how to make something useful.

    As I started, though, I remembered my free version of Atlassian’s Status Page. Although the free version is limited in the number of subscribers and no text subscriptions, for my usage, it’s perfect. And, near and dear to my developer heart, it has a very well defined set of APIs for managing statuses and incidents.

    So, with Python and some additional modules, I created a project that lets me do a quick request on a desired website. If the website is down, the Status Page status for the associated component is changed and an incident is created. If/when it comes back up, any open incidents associated with that component are closed.

    Viola!

    After a few evening hours tinkering, I have the scripts doing some basic work. For now, a cron job executes the script every 5 minutes, and if a site goes down it is reported to my statuspage site.

    Long term, I plan on adding support for more in-depth checks of my own projects, which utilized .Net’s HealthChecks namespace to report service health automatically. I may also look into setting up the scripts as a service running on the Pi.

    If you are interested, the code is shared on Github.

  • Free Guy and the “Myth” of AI

    I have been able to get out and enjoy some movies with my kids over the last few weeks. Black Widow, Jungle Cruise, and, most recently, Free Guy, have given me the opportunity to get back in the theaters, something I did not realize I missed as much as I did.

    The last of those, Free Guy, is one of the funniest movies I have seen in a long time, and, considering the trailers, I am not giving anything away when I say there is an element of artificial intelligence within the plot. And it got me thinking more about how AI is perceived versus what it can do, and perhaps how that perception is oddly self-limiting.

    Erik Larson’s The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do explores this topic in much greater depth than I can here, but Larson’s views mirror my own: the myth isn’t that true AI is possible, but rather, the myth is that its arrival is inevitable based on our present trajectory. And the business of AI is interfering with the science of AI in some very big ways.

    Interfering? How, do you ask, can monetizing artificial intelligence interfere with its own progress?

    AI today is good at narrow applications involving inductive reasoning and data processing, like recognizing images or playing games. But these successes do not push AI towards a more general intelligence. These successes do, however, make AI a viable business offering.

    Human intelligence is a blend of inductive reasoning and conjecture, i.e. guesses. These are guesses informed by our own experiences and the context of the situation, called abduction by the AI community. And we have no idea how to program this type of contextual/experiential guessing into computers today. But our success in those narrow areas has taken the focus away from understanding the complexities of abduction, and has stifled innovation within the field.

    It is a scientific unknown as to whether or not we are capable of producing an artificial intelligence with levels of both inductive reasoning and conjecture. But to assume it will “just happen” by chipping away at one part of the problem is folly. Personally, I believe artificial intelligence is possible, but not without a shift in focus from productization to research and innovation. If we understand how people make decisions, we can not only try to mimic that behavior with AI, but also gain more insight into ourselves in the process.

  • Hardening your Kubernetes Cluster: Don’t run as root!

    People sometimes ask my why I do not read for pleasure. As my career entails ingesting the NSA/CISA technical report on Kubernetes Hardening Guidance and translating it into actionable material, I ask that you let me enjoy hobbies that do not involve the written word.

    The NSA/CISA technical report on came out on Tuesday, August 3. Quite coincidentally, my colleagues and I have started asking questions about our internal standards for securing Kubernetes clusters for production. Coupled with my current home lab experimentation, I figured it was a good idea to read through this document and see what I could do to secure my lab.

    Hopefully, I will get through this document and how I’ve applied everything to my home lab (or, at least, the production cluster in my home lab). For now, though, I thought it prudent to look at the Pod Security section. And, as one might expect, the first recommendation is…

    Don’t run as root!

    For as long as I can remember working in Linux, not running as root was literally “step one.” So it amazes me that, by default, containers are configured to run as root. All of my home-built containers are pretty much the same: simple docker files that copy the results of an external dotnet publish command into the container and then run the dotnet entry point.

    My original docker files used to look like this:

    FROM mcr.microsoft.com/dotnet/aspnet:5.0-focal AS base
    WORKDIR /app
    COPY . /app
    
    EXPOSE 80
    ENTRYPOINT ["dotnet", "my.dotnet.dll"]

    With some StackOverflow assistance, now, it looks something like this:

    FROM mcr.microsoft.com/dotnet/aspnet:5.0-focal AS base
    WORKDIR /app
    COPY . /app
    # Create a group and user so we are not running our container and application as root and thus user 0 which is a security issue.
    RUN addgroup --system --gid 1000 mygroup \
        && adduser --system --uid 1000 --ingroup mygroup --shell /bin/sh nmyuser
      
    # Serve on port 8080, we cannot serve on port 80 with a custom user that is not root.
    ENV ASPNETCORE_URLS=http://+:8080
    EXPOSE 8080
      
    # Tell docker that all future commands should run as the appuser user, must use the user number
    USER 1000
    
    ENTRYPOINT ["dotnet", "my.dotnet.dll"]

    What’s with the port change?

    The docker file changes are pretty simple: add the commands to add a new group and a new user, and using the USER command, tell the docker file to execute as the new user. But why change the ASPNETCORE_URLS and port change? When running as a non-root user, you are restricted to ports above 1024, so I had to change the exposed port. This necessitated some changes to my helm charts and their service deployment, but, overall, the process was straightforward.

    My next steps

    When I find some spare time in next few months, I’m going to revisit Pod Security Policies and, more specifically, it’s upcoming replacement: PSP Replacement Policy. I find it amusing that the NSA/CISA released guidelines that specify usage of what is now a deprecated feature. I also find it typical of our field that our current name for the new version is, quite simply, “Pod Security Policy Replacement Policy.” I really hope they get a better name for that…

  • First Impressions: Windows Terminal

    From my early days on the Commodore 64 to my current work with Linux (/bin/bash) and Windows (powershell, mostly), I have spent a tremendous amount of time in command lines over the course of my life. So, when I stumbled across Windows Terminal, it seemed like a good opportunity to evaluate a new container for my favorite command lines.

    Microsoft Windows Terminal

    Windows Terminal is an open source application from Microsoft that touts itself as “… a modern terminal application for users of command-line tools and shells …”. It promotes features such multiple tabs, panes, Unicode and UTF-8 character support, a GPU accelerated text rendering engine, and the ability to create your own themes and customize text, colors, backgrounds, and shortcuts. [1]

    In my experience, it lives up to that billing. The application is easy to install (in particular with Chocolatey), quick to configure, and provides a wide range of features to make managing my command line windows much easier.

    Install and Initial Configuration

    I use Chocolatey pretty heavily to manage my installs, and thankfully, there is a package for Windows terminal:

    choco install microsoft-windows-terminal -y

    It is worth noting that the recommended installation method is actually through the Microsoft Windows Store, not Chocolatey. It is also worth nothing that uninstalling Windows Terminal from Chocolatey deletes your settings file, so if you want to switch, be sure to backup that settings file before uninstalling.

    When I first installed Windows Terminal, there was no User Interface for settings, which meant opening the Settings file and editing its JSON. The settings file is fairly intuitive and available settings are well documented, which made my initial setup pretty easy. Additionally, as all settings are stored in the JSON file, migrating settings from one machine to another is as simple as copying the file between machines. Starting with version 1.8, a Settings UI was added to help ease some of the setup.

    Additional Tools Setup

    As I perused the documentation, I came across the setup for Powerline, which provides custom command line prompts for Git repositories. I was immediately intrigued: I have been using posh-git for years, and Powerline extends posh-git and oh-my-posh and adds some custom glyphs for graphical interfaces. The installation tutorial is well done and complete, which is no surprise considering the source material comes from Mr. Hanselman.

    My home lab work has brought me squarely back into the realm of Linux and SSH, which was yet another reason I was looking for an updated terminal tool. While there is no explicit profile help for SSH, there is a good tutorial on configurating SSH profiles.

    Summary

    I have been using Windows Terminal now for around 4 months, and in that time, I have become more comfortable with it. I am still a novice when it comes to the various actions and shortcuts that it supports, which is why they are notably absent from this write-up. The general functionality and, in particular, the support for profiles and console coloring, allows for much better organization of what used to be 4-8 powershell console windows open at any one time on my PC. If you are a command line user, Windows Terminal is worth a look.

  • Work / Life Balance: A long-time remote worker’s perspective

    Summertime brings with it some time off for travel and relaxation, coupled with meeting my standard role expectations. As I struggle with balancing my desire to perform the work I “normally” do with the desire to jump in the pool when it’s nice out, it occurs to me that “work / life balance” has received a lot of attention in the past year or so.

    Work / Life Balance?

    A quick Google search of “work life balance working from home” yields approximately 1.5 million results. If you’re looking to me to consolidate these articles into a David Lettermen-esque Top Ten List, well, then I would simply be adding to the stack. Most, if not all of these articles, suggest various rules to help you compartmentalize your world into “work” and “life,” and to help achieve balance. However, I’ll let you in on a little secret:

    It’s all work!

    I have held a number roles in the last twenty years. Professionally, I have ascended from a support technician to application architect, with a lengthy stent in engineering management somewhere in between. Personally, I am a husband, father, friend, neighbor, volunteer, etc. And as I look at this list, I came to a very startling realization: all of my roles require physical or mental effort to complete. In other words, by the definition of the word, it’s all work!

    Role Balancing

    Since everything is work, what does it mean to “balance” your life? In my case, when I choose my roles, both personal and professional, I make a conscious effort to ensure two things:

    1. I will (mostly) enjoy the new role. No role is perfect.
    2. The new role can be integrated with my current roles.

    I will admit that these rules have caused me to turn down more financially lucrative roles in favor of maintaining a balance between my life roles. However, these rules have also allowed me to enjoy success in all of the roles I choose AND simply enjoy the roles themselves.

    Just manage yourself

    In my move from engineer to manager, my manager at the time gave me a great piece of advise: Get your management tasks done first, otherwise your team will falter because they are waiting on you, and the team will fail.

    My personal management tasks boil down to one overall goal: stay healthy, physically and mentally. The tasks to accomplish this goal can vary greatly depending on who you are. After this, do the work! Organize yourself and your roles the best way you know how, execute your tasks, evaluate the results, lather, rinse, repeat…

    I realize that last paragraph boils decades of time and task management methodologies into a single sentence, but the point is, however you get your work done, do it. More importantly, pick roles that you enjoy, minimize the roles you do not enjoy, and you will be worried less about your work-life balance.

    Find a job you enjoy doing, and you will never have to work a day in your life.

    Mark Twain

  • To define or document? Methods for generating API Specifications

    There is an inherent “chicken and egg” problem in API development. Do we define a specification before creating an API implementation (specification-first), or do we implement an API and generate a specification from that implementation (code-first)?

    Determining how to define and develop your APIs can have impacts on future consumption and alternative implementations, so it is important to evaluate the purpose of your API and identify the most effective method for defining your API.

    API (or Code) First

    In API- (or, code-) first development, developers setup a code project and begin coding endpoints and models immediately. Typcially, developers treat the specification as generated documentation of what they have already done. This method means the API definition is fluid as implementation occurs: oh, you forgot a property on an object or a whole path? No problem, just add the necessary code. Automated generation will take care of updating your API specification. Also, when you know that the API you are working on is going to be the only implementation of that interface, this solution makes the most sense, as it requires less upfront work from the development team and it is easier to change the API specification.

    Specification First

    On the other hand, specification-first development entails defining the API specification first. Developers define the paths, actions, responses, and object models before we write any code at all. From there, we can generate both client and server code. This requires more effort: developers must define all the necessary changes prior to the actual implementation on either the client or server side. This upfront effort generates a truly reusable specification, since the API specification is not generated from a single implementation of the API. This method is most useful when developing specifications for APIs that will have multiple implementations.

    What should you use?

    Whichever you want. I’m really not here to tout the benefits of either one. In my experience, the choice depends primarily on answering the following question: Will there be only one implementation of this API? If the answer is yes, then code-first would be my choice, simply because it does not require a definition process up front. If, however, you anticipate more than one implementation of a given API, it is wise to start with the specification first. Changes to the specification should be more deliberate, as they will affect a number of implementations.

    Tools to help

    No matter your selection, there are tools to aid you in both cases.

    For specification first development, the OpenAPI Generator is a great tool for generating consumer libraries and implementations. Once you create the API specification document, the OpenAPI Generator can generate a wide array of clients, servers, and other schemas. I have used the generator to create Axios-based typescript clients for user interfaces as well as model classes for server side development. I have only ever used the OpenAPI generator in a manual generation process: when the developer changes the specification, they must also regenerate the client and server code. This, however, is not a bad thing: specification changes typically must be very deliberate and take into account all existing implementations, so keeping the process manual forces that.

    In my API-first projects, I typically use the NSwag toolchain to both generate the specification from the API code as well as generate any clients that I may need. NSwag’s toolchain integrates nicely with .Net 5 projects and can be configured to generate specifications and additional libraries at build time, making it easy to deploy these changes automatically.

    It is worth nothing that both NSwag and the OpenAPI Generator can be configured to perform both of these methods, my examples above come simply from my own experience with each.

  • Designing for the public cloud without breaking the bank

    The ubiquity and convenience of cloud computing resources are a boon to companies who want to deliver digital services quickly. Companies can focus on delivering content to their consumers rather than investing in core competencies such as infrastructure provisioning, application hosting, and service maintenance. However, for those companies who have already made significant investment into these core competencies, is the cloud really better?

    The Push to (Public) Cloud

    The maturation of the Internet and the introduction of public cloud services has reduced the barrier of entry for companies who want to deliver in the digital space. What used to take a dedicated team of systems engineers and a full server/farm buildout can now be accomplished by a few well-trained folks and a cloud account. This allows companies, particularly those in their early growth phases, to focus more on delivering content rather than growing core technological competencies.

    Many companies with heavy investment in core competencies and private clouds are making decisions to move their products to the cloud. This move can offer established companies a chance to “act as a startup” by refining their focus and deliver content faster. However, when those companies look at their cloud bill, there can be an element of sticker shock. [1]

    Why Move

    For a company with a heavy investment in its own private cloud infrastructure, why would you move?

    Inadequate Skillsets for Innovative Technology

    While companies may have significant investments in IT resources, they may not have the capability or desire to train these resources to maintain and manage the latest technologies. For example, managing a production-level Kubernetes cluster requires skills that may not be present in the company today.

    Managed Services

    This is related to the first, but, in many cases, it is much easier to let the company that created the service host it themselves, rather than trying to host it on your own. Things like Elastic or Kafka can be hosted internally, but letting the experts run your instances as a managed service can save you money in the long run.

    Experimentation

    Cloud accounts can provide excellent sand box environments for senior engineers to prove their work quickly, focusing more on functionality than provisioning. Be careful, though, this can lead to skipping the steps between Proof of Concept and Production.

    Building for the future

    As architects/senior engineers, how can we reconcile this? While we want to iterate quickly and focus on delivering customer value, we must also be mindful of our systems’ design.

    Portability

    The folks at Andreessen Horowitz used the term “repatriation” to describe a move back to private clouds from public clouds. As you can imagine, this may be very easy or very difficult, depending on your architecture: repatriating a virtual machine is much easier than, say, repatriating an API Management service from the public cloud to a private offering.

    As engineers design solutions, it is important to think about the portability of the solution you are creating. Can the solution you are building be hosted anywhere (public or private cloud) or is it tailor made to a specific public cloud? The latter may allow for quick ramp up, but will lock you in to a single cloud with little you can do to improve your spend.

    Cost

    When designing a system for the cloud, you must think a lot harder about the pricing of the individual components. Every time you add another service, storage account, VM, or cluster on your design diagram, you are adding to the cost of your system.

    It is important to remember that some of these costs are consumption based, which means you will need to estimate your usage to get a true idea of the cost. I can think of nothing worst than a huge cloud bill on a new system because I forgot to estimate consumption charges

    SOLID Principles

    SOLID design principles are typically applied to object-oriented programming, however, some of those same principles should be applied to larger cloud architectures. In particular, Interface Segregation and Liskov substitution (or, more generally, design by contract) facilitate simpler changes to services by abstracting the implementation from the service definition. Objects that are easier to replace are easier to repatriate into private clouds when profit margins start to shrink.

    What is the Future?

    Do we need the public cloud? Most likely, yes. Initially, it will increase your time to market by removing some of the blockers and allowing your engineers to focus on content rather than core competencies. However, architects and systems engineers need to be aware of the trade offs in going to the cloud. While private clouds represent a large initial investment and monthly maintenance, public cloud costs, particularly those on consumption based pricing, can quickly spiral out of control if not carefully watched.

    If your company already has a good IT infrastructure and the knowledge in place to run their own private clouds, or if your cloud costs are slowly eating into your profit margins, it is probably a good time to consider repatriation of services into private infrastructure.

    As for the future: a mix of public cloud services and private infrastructure services (hybrid cloud) is most likely in my future, however, I believe the trend will be different depending on the maturity and technical competency of your staff.

    References

    1. “Cloud ‘sticker shock’ explored: We’re spending too much”
    2. “The Cost of Cloud, a Trillion Dollar Paradox”
  • Moving the home lab to Kubernetes

    Kubernetes has become something of the standard for the orchestration of containers. While there are certainly other options, the Kubernetes platform remains the most prevalent. With that in mind, I decided to migrate my home lab from docker servers to Kubernetes clusters.

    Before: Docker Servers

    Long story short: my home lab has transitioned from Windows servers running IIS to a mix of Linux and Windows containers to Linux only containers. The apps are containerized, but the DBs still run on some SQL servers.

    Build and deployment is automated: Build through Azure DevOps Pipelines & Self Hosted Agents (Teamcity before that), and deployment through Octopus Deploy. Container images for my projects live on a Proget server feed.

    The Plan

    “Consolidate” (and I’ll tell you later why that is in quotes) my servers into Kubernetes Clusters. It seemed an easy plan.

    • Internal K8 Cluster – Runs Rancher and any internal tooling (including Elastic/Kibana) I want to be there, but not available externally
    • Non Production K8 Cluster – Runs my *.net and *.org sites, used for test and staging environments
    • Production K8 Cluster – Runs my *.com sites (external) including any external tooling.

    I spent some time learning Packer to provision Hyper-V vms for my clusters. The clusters all ended up with a control plane (4vCPU, 8GB RAM) and two workers (2vCPU, 6GB RAM).

    The Results

    The Kubernetes Clusters

    There was a LOT of trial and error in getting Kubernetes going, particularly with Rancher. So much, in fact, that I probably provisioned the clusters 3 or 4 times each because I felt like I messed up and wanted to do it over again.

    Initially, I tried to manually provision the K8 cluster. Yes, it worked.. but RKE is nicer. And, after my manually provisioned K8 cluster went down, I provisioned the internal cluster with RKE. That makes updates easier, as I have the config file.

    I provisioned the non-production and production clusters using the Rancher GUI. However, that was the “manually provisioned” cluster, so, when it went down, I lost the config files. I currently have two clusters which look like “imported” clusters in Rancher, so they are harder to manage through the Rancher GUI.

    Storage

    In order to utilize persistent volume claims, I configured NFS on my Synology and installed the nfs-subdir-external-provisioner in all of my clusters. It installs a storage class which can be used in persistent volume claims, and will provision directories in my NFS.

    Ingress

    Right now, I’m using the Nginx Ingress controller from Rancher. I haven’t played with it much, other than the basics. Perhaps more on that when I dig in.

    Current Impressions

    Rancher

    It works… but mine is flaky. I think it may be due to some resource starvation. I may try to provision a new internal cluster with better VMs and see how that works.

    I do like the deployment of clusters using RKE, however, I can see how it would be difficult to manage when there is more than one person involved.

    Kubernetes

    Once it was running, it’s great: creating new APIs or apps and getting them running in a scalable fashion is easy. Helm charts make deployment and updating a snap.

    That said, I would not trust myself to run this in production without a LOT more training.

    References