I am working on building a set of small reference applications to demonstrate some of the patterns and practices to help modernize cloud applications. In configuring all of this in my home lab, I spent at least 3 hours fighting a problem that turned out to be a configuration issue.
Backend-for-Frontend Pattern
I will get into more details when I post the full application, but I am trying to build out a SPA with a dedicated backend API that would host the SPA and take care of authentication. As is typically the case, I was able to get all of this working on my local machine, including the necessary proxying of calls via the SPA’s development server (again, more on this later).
At some point, I had two containers ready to go: a BFF container hosting the SPA and the dedicated backend, and an API container hosting a data service. I felt ready to deploy to the Kubernetes cluster in my lab.
Let the pain begin!
I have enough samples within Helm/Helmfile that getting the items deployed was fairly simple. After fiddling with the settings of the containers, things were running well in the non-authenticated mode.
However, when I clicked login, the following happened:
- I was redirected to my oAuth 2.0/OIDC provider.
- I entered my username/password
- I was redirected back to my application
- I got a 502 Bad Gateway screen
502! But, why? I consulted Google and found any number of articles indicating that, in the authentication flow, Nginx’s default header size limit is too small to limit what might be coming back from the redirect. So, consulting the Nginx configuration documents, I changed the Nginx configuration in my reverse proxy to allow for larger headers.
No luck. Weird. In the spirit of true experimentation (change one thing at a time), I backed those changes out and tried changing the configuration of my Nginx Ingress controller. No luck. So what’s going on?
Too Many Cooks
My current implementation looks like this:
flowchart TB A[UI] --UI Request--> B(Nginx Reverse Proxy) B --> C("Kubernetes Ingress (Nginx)") C --> D[UI Pod]
There are two Nginx instances between all of my traffic: an instance outside of the cluster that serves as my reverse proxy, and an Nginx ingress controller that serves as the reverse proxy within the cluster.
I tried changing both separately. Then I tried changing both at the same time. And I was still seeing this error. As it turns out, well, I was being passed some bad data as well.
Be careful what you read on the Internet
As it turns out, the issue was the difference in configuration between the two Nginx instances and some bad configuration values that I got from old internet articles.
Reverse Proxy Configuration
For the Nginx instance running on Ubuntu, I added the following to my nginx.conf
file under the http
section:
proxy_buffers 4 512k;
proxy_buffer_size 256k;
proxy_busy_buffers_size 512k;
client_header_buffer_size 32k;
large_client_header_buffers 4 32k;
Nginx Ingress Configuration
I am running RKE2 clusters, so configuring Nginx involves a HelmChartConfig
resource being created in the kube-system
namespace. My cluster configuration looks like this:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rke2-ingress-nginx
namespace: kube-system
spec:
valuesContent: |-
controller:
kind: DaemonSet
daemonset:
useHostPort: true
config:
use-forwarded-headers: "true"
proxy-buffer-size: "256k"
proxy-buffers-number: "4"
client-header-buffer-size: "256k"
large-client-header-buffers: "4 16k"
proxy-body-size: "10m"
The combination of both of these settings got my redirects to work without the 502 errors.
Better living through logging
One of the things I fought with on this was finding the appropriate logs to see where the errors were occurring. I’m exporting my reverse proxy logs into Loki using a Promtail instance that listens on a syslog port. So I am “getting” the logs into Loki, but I couldn’t FIND them.
I forgot about the facility in syslog: I have the access logs sending as local5
, but did configured the error logs without pointing them to local5
. I learned that, by default, they go to local7
.
Once I found the logs I was able to diagnose the issue, but I spent a lot of time browsing in Loki looking for those logs.