Deploying hybrid Talos homelab cluster with KubeSpan

In this blog post series I will run through my hybrid Talos cluster bootstrap and configuration process. This post is an introduction to my personal set-up as well as a helicopter view of bootstrap and deployment process, as well as challenges encountered during the process.

A hybrid Kubernetes cluster - what is it and why do I need it?

On my self hosted homelab journey I went through multiple iterations of Kubernetes cluster deployment, however, all of them shared a common trait - they were running fully locally, usually as virtual machines or dedicated hardware nodes.

For better energy efficiency I am running low power processors. However, these come with a disadvantage - IO heavy workloads will grind hardware to a halt - SSDs being the weak spot. This brings etcd down to its knees and control plane nodes fail. This often has a cascading effect - pods become unavailable, traffic between and to pods gets disrupted as load balancer stops serving traffic to those nodes.

While I accept that running fully local setup makes me vulnerable to energy supply and consumer level ISP issues, I still want my control plane to remain available while experiencing temporary energy or internet problems (albeit for me both are extremely stable in the Netherlands).

To be more resilient to nodes with starved Io and commodity supply issues I always envisioned a hybrid setup. I realise it’s not “fully local” setup and that’s a tradeoff I accept.

When I learned that Talos comes with KubeSpan - ability to run Kubernetes in a hybrid cloud / environment - I thought this would be a perfect setup to experiment with.

Requirments for a hybrid setup. Provider and resources.

I chose Hetzner for control plane nodes due to its balance of cost, availability and proximity. To keep things minimal, I opted for CX32 - shared vCPU intel node with 8GB memory and 4 vCPUs. Dedicated vCPUs ideal would be perfect to guarantee performance, but it would come at much higher cost. In the long run, I could always switch if I find performance lacking.

My connectivity to Hetzner Falkenstein is extremely good. However, a few times I found my upstream dropping to a tenth of my ISP provided upstream speed for about a week before restoring to its usual state. I could not figure out why that was the case (I did have a dodgy ethernet connection between my switch and the router being the reason once). Apart from that, the connectivity has been great.

Connectivity setup.

Each control plane node will have a public IP address. All nodes at home will share a single IP address which may become an issue when trying to establish mesh networking, so I have to ensure KubeSpan uses the right IP address to expose wireguard connection. Per KubeSpan docs, only one node needs to allow inbound connections on UDP 51820, which fits our topology.

Control plane nodes will have public IP address with inbound UDP 51280 port open, worker nodes will be behind a single IP address and should be able to join the cluster with no issues. I need to filter out local IP addresses of worker nodes that are not reachable from control plane nodes. This can be achieved via filters field. For API access I will use a DNS record A with all control plane nodes IP behind a single DNS entry.

Base image preparation.

I use a different image for worker modes with a few extra modules enabled:

customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/i915
      - siderolabs/intel-ucode
      - siderolabs/iscsi-tools
      - siderolabs/nfsd
      - siderolabs/nvme-cli
      - siderolabs/thunderbolt
      - siderolabs/util-linux-tools

For intel based machines this ensures hardware components (thunderbolt, igpu) will work as expected.

Provisioning of cloud instances and configuring local nodes.

Provisioning of machines and the cluster will be done in multiple phases. While I am sure it can be done with infrastructure as code exclusively, having imperative bootstrap stages makes race conditions less likely while supporting sufficient level of repeatability. This works well with talos for it provides a good level of idempotency.

For provisioning control plane nodes on Hetzner I will use terraform, for local nodes I will boot machines via KVM from Talos image. Since IPs are fixed, I can run scripts that runs talosctl against worker nodes.

Staged deployment. Phase 0 - prerequisites - DNS.

I configured a DNS A record with control plane public IP addresses. The catch is - I don’t know what the IP addresses are. I need to run terraform apply to create Hetzner instances, get IPs via output and stuff them in the DNS record.

Coming posts.

In the coming blog posts I will walk through all phases of deployment.

Here’s the outline of the provisioning phases.

Staged deployment. Phase 0 - prerequisites - Hetzner instances provisioning via terraform.

Staged deployment. Phase 1 - preparing the environment and the context.

In this phase, I’ll create .env file with info like control plane and workers IP addresses and so on.

Staged deployment. Phase 2 - install talos on Hetzner instances and worker nodes.

This will boot talos ISO and install it on a boot drive.

Staged deployment. Phase 3 - generating config files.

This phase will generate configuration files for talos to apply on my nodes. This includes various patches required for ingress and storage.

Staged deployment. Phase 4 - apply configs.

This phase will apply talos configuration to all nodes.

Staged deployment. Phase 5 - bootstrap cluster.

This phase will boostrap the cluster via a single node. Etcd will be initialized. At this point, the cluster should have all nodes joined and the API should be accessible.

Staged deployment. Phase 6 - copy kubeconfig.

This phase simply copies generated kubeconfig to my home folder. This phase can also include copying talos config, however, I use ./generated folder and run talos commands via make.

Code.

The code used to deploy the cluster is available via Github - sashkachan/talos-kubespan-bootstrap. I will use this code for the walkthrough of all phases and configuration required to make it succeed.