1. Purpose
This note documents the technical changes applied to the Fairhaven Raspberry Pi (SSD root filesystem) to improve remote recoverability after reboots and partial failures.
The driving failure mode was:
Root filesystem occasionally booting as read-only (RO)
Docker and/or Tailscale failing early and requiring manual recovery
Risk of remote lockout if Tailscale controlled DNS during a failure
Goal:
After a soft reboot (or watchdog reset), the system should automatically return to:
- RW root filesystem
- Docker running
- Home Assistant + Mosquitto containers running
- Tailscale reachable
If Home Assistant becomes unresponsive, automated recovery should restart HA, and if needed reboot the host in a controlled way.
2. Baseline system components
Raspberry Pi booting from an SSD (root on a USB-SATA bridge)
Docker running:
- Home Assistant container homeassistant
- Mosquitto container mosquitto
Tailscale for remote access
NetworkManager provides DNS (/etc/resolv.conf)
3. Changes implemented
3.1 Disable Tailscale DNS takeover
Problem addressed: If Tailscale owns DNS and it cannot connect/log in, DNS resolution can fail, which prevents Tailscale from recovering — causing remote lockout.
Change: Disable Tailscale DNS management.
Command(s):
sudo tailscale set --accept-dns=false sudo systemctl restart tailscaled
Result / verification:
/etc/resolv.conf is now generated by NetworkManager (not Tailscale).
ls -l /etc/resolv.conf cat /etc/resolv.conf
3.2 Harden Docker service restart behaviour by removing start-rate limiting
Problem addressed: Docker could fail during early boot (especially if / is RO) and then become "stuck failed" due to systemd start-rate limits, requiring manual systemctl reset-failed docker.
Change: Disable systemd start-rate limiting for Docker and keep a conservative restart delay.
File:
/etc/systemd/system/docker.service.d/override.conf
Intended contents:
[Unit] StartLimitIntervalSec=0 StartLimitBurst=0 [Service] Restart=always RestartSec=10
Apply / reload:
sudo systemctl daemon-reload sudo systemctl restart docker
Verification (properties on this system):
systemctl show docker -p StartLimitIntervalUSec -p StartLimitBurst -p Restart -p RestartUSec
Expected:
StartLimitIntervalUSec=0
StartLimitBurst=0
Restart=always
RestartUSec=10s
3.3 Ensure HA + Mosquitto containers restart automatically
Problem addressed: Containers might not come back automatically after Docker restarts or after a host reboot.
Change: Set container restart policy to unless-stopped.
Command(s):
docker update --restart unless-stopped homeassistant mosquitto
Verification:
docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto
Expected:
/homeassistant -> unless-stopped
/mosquitto -> unless-stopped
3.4 Add early boot helper service to remount / RW
Problem addressed: Root sometimes boots RO; early services can fail if / remains RO during their start.
Change: Add a oneshot service that remounts / RW if needed, before Docker/Tailscale start.
File:
/etc/systemd/system/jf-remount-root-rw.service
Contents:
[Unit] Description=Ensure root filesystem is read-write before starting key services DefaultDependencies=no After=local-fs.target Before=docker.service tailscaled.service [Service] Type=oneshot ExecStart=/bin/sh -c 'findmnt -no OPTIONS / | grep -q "<rw>" || mount -o remount,rw /' RemainAfterExit=yes [Install] WantedBy=multi-user.target
Enable:
sudo systemctl daemon-reload sudo systemctl enable jf-remount-root-rw.service
Verification:
systemctl is-active jf-remount-root-rw findmnt / -o OPTIONS
Expected:
Service active
/ shows rw,...
3.5 Hardware watchdog enabled (kernel-level auto-reboot)
Problem addressed: If the Pi locks up (kernel/userspace hang), a soft reboot may be impossible remotely. A watchdog allows automatic recovery.
Change: Enable BCM2835 hardware watchdog and configure systemd watchdog feeding.
Verification evidence from the live system:
ls -l /dev/watchdog* systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec dmesg | grep -i watchdog | tail -n 50
Expected:
/dev/watchdog0 exists
systemd reports using hardware watchdog
RuntimeWatchdogUSec non-zero (e.g. ~1 minute)
RebootWatchdogUSec configured (e.g. 2 minutes)
3.6 Home Assistant local healthcheck with staged recovery (restart HA; reboot host if prolonged failure)
Problem addressed: HA can become unresponsive while Linux is still alive. Hardware watchdog won’t help in that case.
Change:
A script checks http://127.0.0.1:8123/api/ every minute.
If HA is down:
- after 3 consecutive failures: restart HA container (with cooldown)
- after prolonged failure: reboot host (with cooldown and loop protection)
Script file:
/usr/local/sbin/jf_ha_healthcheck.sh
Systemd units:
/etc/systemd/system/jf-ha-healthcheck.service
/etc/systemd/system/jf-ha-healthcheck.timer
Health definition:
HA is considered "alive" if the local endpoint returns: 2xx, 3xx, 401, or 403. (Note: HA commonly returns 401 at /api/ without auth; this counts as healthy.)
Key policy values (as coded):
| Parameter | Value | Meaning |
|---|---|---|
| FAILS_TO_RESTART | 3 | restart HA container after 3 consecutive failures |
| RESTART_COOLDOWN_SECS | 600 | at most one container restart per 10 minutes |
| FAILS_TO_REBOOT | 15 | reboot host after ~15 minutes of continuous failures |
| POST_RESTART_GRACE_SECS | 300 | wait 5 minutes after container restart before allowing reboot |
| REBOOT_COOLDOWN_SECS | 3600 | at most one reboot per 60 minutes |
Verification / logs:
systemctl list-timers --all | grep jf-ha-healthcheck journalctl -t jf_ha_healthcheck --no-pager | tail -n 80
4. Test result
A soft reboot was performed and the system returned to:
/ mounted RW
Tailscale up and reachable
Docker active
HA + Mosquitto containers running
5. Notes / items discussed but not implemented in this note
Subnet routing via Teltonika (RUT240) to provide out-of-band Tailscale access even if Pi-Tailscale fails — recommended, but not evidenced as implemented here.
Root-cause analysis of recurring RO boot events (USB bridge/cable/power/storage) remains separate work; this set of changes improves recovery behaviour and reduces remote lockout risk.
Appendix A — Quick Audit Commands
Root status
findmnt / -o SOURCE,FSTYPE,OPTIONS
Core services
systemctl is-active jf-remount-root-rw
systemctl is-active docker
systemctl is-active tailscaled
Containers
docker ps
docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto
Tailscale DNS preference
sudo tailscale set --accept-dns=false
ls -l /etc/resolv.conf
cat /etc/resolv.conf
Watchdog
ls -l /dev/watchdog*
systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec
dmesg | grep -i watchdog | tail -n 30
HA healthcheck
systemctl list-timers --all | grep jf-ha-healthcheck
journalctl -t jf_ha_healthcheck --no-pager | tail -n 60
<< Raspberry Pi 4 Configuration Overview | | Remote Recovery Run Sheet >> |Table of Contents>
