1. Purpose

This note documents the technical changes applied to the Fairhaven Raspberry Pi (SSD root filesystem) to improve remote recoverability after reboots and partial failures.

The driving failure mode was:

Root filesystem occasionally booting as read-only (RO)

Docker and/or Tailscale failing early and requiring manual recovery

Risk of remote lockout if Tailscale controlled DNS during a failure

Goal:

After a soft reboot (or watchdog reset), the system should automatically return to:

  • RW root filesystem
  • Docker running
  • Home Assistant + Mosquitto containers running
  • Tailscale reachable

If Home Assistant becomes unresponsive, automated recovery should restart HA, and if needed reboot the host in a controlled way.

2. Baseline system components

Raspberry Pi booting from an SSD (root on a USB-SATA bridge)

Docker running:

  • Home Assistant container homeassistant
  • Mosquitto container mosquitto

Tailscale for remote access

NetworkManager provides DNS (/etc/resolv.conf)

3. Changes implemented

3.1 Disable Tailscale DNS takeover

Problem addressed: If Tailscale owns DNS and it cannot connect/log in, DNS resolution can fail, which prevents Tailscale from recovering — causing remote lockout.

Change: Disable Tailscale DNS management.

Command(s):

sudo tailscale set --accept-dns=false
sudo systemctl restart tailscaled

Result / verification:

/etc/resolv.conf is now generated by NetworkManager (not Tailscale).

ls -l /etc/resolv.conf
cat /etc/resolv.conf

3.2 Harden Docker service restart behaviour by removing start-rate limiting

Problem addressed: Docker could fail during early boot (especially if / is RO) and then become "stuck failed" due to systemd start-rate limits, requiring manual systemctl reset-failed docker.

Change: Disable systemd start-rate limiting for Docker and keep a conservative restart delay.

File:

/etc/systemd/system/docker.service.d/override.conf

Intended contents:

[Unit]
StartLimitIntervalSec=0
StartLimitBurst=0

[Service]
Restart=always
RestartSec=10

Apply / reload:

sudo systemctl daemon-reload
sudo systemctl restart docker

Verification (properties on this system):

systemctl show docker -p StartLimitIntervalUSec -p StartLimitBurst -p Restart -p RestartUSec

Expected:

StartLimitIntervalUSec=0

StartLimitBurst=0

Restart=always

RestartUSec=10s

3.3 Ensure HA + Mosquitto containers restart automatically

Problem addressed: Containers might not come back automatically after Docker restarts or after a host reboot.

Change: Set container restart policy to unless-stopped.

Command(s):

docker update --restart unless-stopped homeassistant mosquitto

Verification:

docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto

Expected:

/homeassistant -> unless-stopped

/mosquitto -> unless-stopped

3.4 Add early boot helper service to remount / RW

Problem addressed: Root sometimes boots RO; early services can fail if / remains RO during their start.

Change: Add a oneshot service that remounts / RW if needed, before Docker/Tailscale start.

File:

/etc/systemd/system/jf-remount-root-rw.service

Contents:

[Unit]
Description=Ensure root filesystem is read-write before starting key services
DefaultDependencies=no
After=local-fs.target
Before=docker.service tailscaled.service

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'findmnt -no OPTIONS / | grep -q "<rw>" || mount -o remount,rw /'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable:

sudo systemctl daemon-reload
sudo systemctl enable jf-remount-root-rw.service

Verification:

systemctl is-active jf-remount-root-rw
findmnt / -o OPTIONS

Expected:

Service active

/ shows rw,...

3.5 Hardware watchdog enabled (kernel-level auto-reboot)

Problem addressed: If the Pi locks up (kernel/userspace hang), a soft reboot may be impossible remotely. A watchdog allows automatic recovery.

Change: Enable BCM2835 hardware watchdog and configure systemd watchdog feeding.

Verification evidence from the live system:

ls -l /dev/watchdog*
systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec
dmesg | grep -i watchdog | tail -n 50

Expected:

/dev/watchdog0 exists

systemd reports using hardware watchdog

RuntimeWatchdogUSec non-zero (e.g. ~1 minute)

RebootWatchdogUSec configured (e.g. 2 minutes)

3.6 Home Assistant local healthcheck with staged recovery (restart HA; reboot host if prolonged failure)

Problem addressed: HA can become unresponsive while Linux is still alive. Hardware watchdog won’t help in that case.

Change:

A script checks http://127.0.0.1:8123/api/ every minute.

If HA is down:

  • after 3 consecutive failures: restart HA container (with cooldown)
  • after prolonged failure: reboot host (with cooldown and loop protection)

Script file:

/usr/local/sbin/jf_ha_healthcheck.sh

Systemd units:

/etc/systemd/system/jf-ha-healthcheck.service

/etc/systemd/system/jf-ha-healthcheck.timer

Health definition:

HA is considered "alive" if the local endpoint returns: 2xx, 3xx, 401, or 403. (Note: HA commonly returns 401 at /api/ without auth; this counts as healthy.)

Key policy values (as coded):

ParameterValueMeaning
FAILS_TO_RESTART3restart HA container after 3 consecutive failures
RESTART_COOLDOWN_SECS600at most one container restart per 10 minutes
FAILS_TO_REBOOT15reboot host after ~15 minutes of continuous failures
POST_RESTART_GRACE_SECS300wait 5 minutes after container restart before allowing reboot
REBOOT_COOLDOWN_SECS3600at most one reboot per 60 minutes

Verification / logs:

systemctl list-timers --all | grep jf-ha-healthcheck
journalctl -t jf_ha_healthcheck --no-pager | tail -n 80

4. Test result

A soft reboot was performed and the system returned to:

/ mounted RW

Tailscale up and reachable

Docker active

HA + Mosquitto containers running

5. Notes / items discussed but not implemented in this note

Subnet routing via Teltonika (RUT240) to provide out-of-band Tailscale access even if Pi-Tailscale fails — recommended, but not evidenced as implemented here.

Root-cause analysis of recurring RO boot events (USB bridge/cable/power/storage) remains separate work; this set of changes improves recovery behaviour and reduces remote lockout risk.

Appendix A — Quick Audit Commands


Root status

findmnt / -o SOURCE,FSTYPE,OPTIONS

Core services

systemctl is-active jf-remount-root-rw
systemctl is-active docker
systemctl is-active tailscaled

Containers

docker ps
docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto

Tailscale DNS preference

sudo tailscale set --accept-dns=false
ls -l /etc/resolv.conf
cat /etc/resolv.conf

Watchdog

ls -l /dev/watchdog*
systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec
dmesg | grep -i watchdog | tail -n 30

HA healthcheck

systemctl list-timers --all | grep jf-ha-healthcheck
journalctl -t jf_ha_healthcheck --no-pager | tail -n 60

<< Raspberry Pi 4 Configuration Overview | | Remote Recovery Run Sheet >>      |Table of Contents>


Page last modified on February 23, 2026, at 12:41 pm