I often use the projects around my home to toy around with new to me technology. Now I’ve been playing around with running a whole assortment of services for myself all the way from Plex, a media server, to Vaultwarden, a self hosted password manager based on Bitwarden. All of this was running within a bunch of docker compose files and the fact that this meant that all these services relied on a single machine being up and connected to the internet really didn’t sit right with me. One of the more painful services to lose access to during some unexpected outages of that machine was Home Assistant, my home automation platform. I’ve definitely lost the ability to turn lights on and off after an long day of work only to have to do more troubleshooting in what can only be described as that scene from Malcolm in the Middle where Hal attempts to change a light bulb and finds a whole days worth of tasks to complete to fix the initial issue.

Removing single points of failure

One of the reasons why this system failed so spectacularly so often can be attributed to the fact that all of my critical infrastructure was running on a single machine which I constantly used as a test bed for new projects.

It had always been in my head that I wanted some way to run Home Assistant on another machine if I needed to do maintenance on my primary server. A few key things stood in my way to achieving this goal. I needed my service to be available no matter where it was, I needed its state to be up-to-date everywhere so it can move at a moments notice, and I needed the migration of the services to be entirely hands-off. Because of these requirements I decided to use different services paired together with each other.

  • Tailscale
  • Consul
  • Traefik
  • GlusterFS
  • And lastly Nomad.

Keeping a single point of entry for end user devices

Tailscale

Now in my particular setup I have a DigitalOcean droplet I am running publicly to run my website, It also serves as a way to access services I run at home without a stable public IP address. I am using a VPN service called Tailscale which provides me with a very minimal setup private network that I plan on having all the nodes in my cluster live on so they can all communicate with each other without needing to worry about network problems which may get in the way. Having each node on the same network also drastically simplifies routing requests to a publicly facing node for ease of accessing services.

Now that everything is connected I wanted to make sure that I didn’t have to maintain a list of what node Home Assistant could possibly be running on. The last thing I want is to try a bunch of IP addresses and ports to hunt for where Home Assistant is currently running. I also didn’t want to be in the business of constantly updating a DNS entry to point to the real live instance of Home Assistant either.

Consul

Luckily this is a problem that is solved today by using Consul, a service discovery platform which pairs nicely with the rest of my planned stack. Its sole job is to keep track of what services are running where and the health of each individual service. I went ahead and installed a Consul agent on each node I wanted to have in my cluster and made sure to bootstrap an odd number of them as voting members so they could find quorum with each other. This attribute is very important in clustered computing as if you lose quorum and have an even number of authoritative members in a cluster they could split and each have different copies of entirely valid state with no one to break the tie.

Consul needed to be setup to advertise only the ip addresses assigned to Tailscale which I accomplished by adding bind_addr to my Consul config and setting it to a go-sockaddr template which grabs the ip address for a given interface name, in this case tailscale0.

bind_addr = "{{ GetInterfaceIP \"tailscale0\" }}"

Now services returned by Consul will automatically be routable to every other node in this cluster.

Installed with Consul is a DNS server which can be used to get at services without having to know where the service actually is. For example, if you had MongoDB as a service you would find it by querying mongodb.service.consul to Consul’s DNS server and it would return a list of IP addresses which expose the mongodb service. There are a few other primitives as well but the one we care about at the moment is node for finding a specific node in a cluster.

Traefik

Lastly to tie together the public part of this problem I need a way to expose the services. This is where I pulled in Traefik to automatically add and remove services to its catalog of routers. Conveniently Traefik knows how to talk to Consul to get a list of services and more importantly where they are located in order to act as a reverse proxy to route incoming requests to the service seemlessly.

Maintaining application state across multiple nodes

GlusterFS

Next I had to find a way to maintain Home assistant’s configuration across multiple nodes at the same time. I have a whole slew of configuration files which need to be kept up-to-date in order for automations to continue to work. Home assistant also keeps a few key files which it uses to maintain its own internal state of entities it knows about. I didn’t want to rely on a single NAS since the whole reason for going down this path was to remove single points of failure.

I eventually found GlusterFS, a self-hosted distributed file store. I installed this onto my primary server, an Intel NUC, and a few Raspberry Pis I had left around from a few other projects. I peered them together with gluster peer probe <consul node hostname> and setup a replicated volume to ensure that each node had all the files it needed locally and would sync file operations to the cluster if any one of them changed. It’s important to note that I am not going for a highly available installation of Home Assistant but instead a highly resilient installation of Home Assistant. At least on a per-node basis.

I mounted the replicated volumes to the same places on each node and copied over the required Home Assistant configuration into a service specific directory. I went ahead and logged into the other machines just to make sure that everything was looking golden and sure enough all the files were there.

Rescheduling workloads on nodes which are no longer available

Nomad

Enter Nomad, the task scheduler I didn’t know I needed. Nomad’s entire role in this setup is to know how much capacity each node has, and pack workloads efficiently onto the cluster as a whole. It also notices when workloads are not running on nodes and can reschedule them to run else where.

Setup for nomad looks a lot like the setup for Consul, there is a few things I thought I’d call out. Now since I have a DigitalOcean instance and that obviously doesn’t have direct access to the devices within my home, it is connected via Tailscale to expose the service publicly. If you are familiar with Cloudflare tunnel, this is essentially a way of making a private version of that entire service. I then use this to expose any of the services that I want to be made available publicly.

I set up a Home Assistant nomad file that constrains the job to only run on nodes tagged with the “home” region. I then expose a service as “homeassistant” which lets Consul know about the service and in turn lets Traefik know about the service to expose.

job "homeassistant" {
    # ...
    group "homeassistant" {
        count = 1
        constraint {
          attribute = "${meta.region}"
          value = "home"
        }
        network {
            port "homeassistant" {
                to = 8123
                static = 8123
            }
        }
        task "homeassistant" {
            driver = "docker"
            config {
                network_mode = "host"
                image = "homeassistant/home-assistant:latest"
                privileged = true
            }
            service {
                name = "${TASK}"
                port = "homeassistant"
                tags = [
                  "traefik.enable=true"
                ]
            }
        }
    }
    # ...
}

Now I am well aware that maintaining a whole cluster of machines for a few the pleasure of turning a few light bulbs on and off but if it was about turning on and off lights then I would have stopped this whole endeavour long ago. I’ll have a few other posts which go into more detail about each aspect of the setup considering Home assistant has some hardware constrains within my installation like Zigbee radios which always need to be available.