Raspberry Pi Remote Recovery Hardening

1. Purpose

This note documents the technical changes applied to the Fairhaven Raspberry Pi (SSD root filesystem) to improve remote recoverability after reboots and partial failures.

The driving failure mode was:

Root filesystem occasionally booting as read-only (RO)

Docker and/or Tailscale failing early and requiring manual recovery

Risk of remote lockout if Tailscale controlled DNS during a failure

Goal:

After a soft reboot (or watchdog reset), the system should automatically return to:

RW root filesystem
Docker running
Home Assistant + Mosquitto containers running
Tailscale reachable

If Home Assistant becomes unresponsive, automated recovery should restart HA, and if needed reboot the host in a controlled way.

2. Baseline system components

Raspberry Pi booting from an SSD (root on a USB-SATA bridge)

Docker running:

Home Assistant container homeassistant
Mosquitto container mosquitto

Tailscale for remote access

NetworkManager provides DNS (/etc/resolv.conf)

3. Changes implemented

3.1 Disable Tailscale DNS takeover

Problem addressed: If Tailscale owns DNS and it cannot connect/log in, DNS resolution can fail, which prevents Tailscale from recovering — causing remote lockout.

Change: Disable Tailscale DNS management.

Command(s):

sudo tailscale set --accept-dns=false
sudo systemctl restart tailscaled

Result / verification:

/etc/resolv.conf is now generated by NetworkManager (not Tailscale).

ls -l /etc/resolv.conf
cat /etc/resolv.conf

3.2 Harden Docker service restart behaviour by removing start-rate limiting

Problem addressed: Docker could fail during early boot (especially if / is RO) and then become "stuck failed" due to systemd start-rate limits, requiring manual systemctl reset-failed docker.

Change: Disable systemd start-rate limiting for Docker and keep a conservative restart delay.

File:

/etc/systemd/system/docker.service.d/override.conf

Intended contents:

[Unit]
StartLimitIntervalSec=0
StartLimitBurst=0

[Service]
Restart=always
RestartSec=10

Apply / reload:

sudo systemctl daemon-reload
sudo systemctl restart docker

Verification (properties on this system):

systemctl show docker -p StartLimitIntervalUSec -p StartLimitBurst -p Restart -p RestartUSec

Expected:

StartLimitIntervalUSec=0

StartLimitBurst=0

Restart=always

RestartUSec=10s

3.3 Ensure HA + Mosquitto containers restart automatically

Problem addressed: Containers might not come back automatically after Docker restarts or after a host reboot.

Change: Set container restart policy to unless-stopped.

Command(s):

docker update --restart unless-stopped homeassistant mosquitto

Verification:

docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto

Expected:

/homeassistant -> unless-stopped

/mosquitto -> unless-stopped

3.4 Add early boot helper service to remount / RW

Problem addressed: Root sometimes boots RO; early services can fail if / remains RO during their start.

Change: Add a oneshot service that remounts / RW if needed, before Docker/Tailscale start.

File:

/etc/systemd/system/jf-remount-root-rw.service

Contents:

[Unit]
Description=Ensure root filesystem is read-write before starting key services
DefaultDependencies=no
After=local-fs.target
Before=docker.service tailscaled.service

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'findmnt -no OPTIONS / | grep -q "<rw>" || mount -o remount,rw /'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable:

sudo systemctl daemon-reload
sudo systemctl enable jf-remount-root-rw.service

Verification:

systemctl is-active jf-remount-root-rw
findmnt / -o OPTIONS

Expected:

Service active

/ shows rw,...

3.5 Hardware watchdog enabled (kernel-level auto-reboot)

Problem addressed: If the Pi locks up (kernel/userspace hang), a soft reboot may be impossible remotely. A watchdog allows automatic recovery.

Change: Enable BCM2835 hardware watchdog and configure systemd watchdog feeding.

Verification evidence from the live system:

ls -l /dev/watchdog*
systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec
dmesg | grep -i watchdog | tail -n 50

Expected:

/dev/watchdog0 exists

systemd reports using hardware watchdog

RuntimeWatchdogUSec non-zero (e.g. ~1 minute)

RebootWatchdogUSec configured (e.g. 2 minutes)

3.6 Home Assistant local healthcheck with staged recovery (restart HA; reboot host if prolonged failure)

Problem addressed: HA can become unresponsive while Linux is still alive. Hardware watchdog won’t help in that case.

Change:

A script checks http://127.0.0.1:8123/api/ every minute.

If HA is down:

after 3 consecutive failures: restart HA container (with cooldown)
after prolonged failure: reboot host (with cooldown and loop protection)

Script file:

/usr/local/sbin/jf_ha_healthcheck.sh

Systemd units:

/etc/systemd/system/jf-ha-healthcheck.service

/etc/systemd/system/jf-ha-healthcheck.timer

Health definition:

HA is considered "alive" if the local endpoint returns: 2xx, 3xx, 401, or 403. (Note: HA commonly returns 401 at /api/ without auth; this counts as healthy.)

Key policy values (as coded):

Parameter	Value	Meaning
FAILS_TO_RESTART	3	restart HA container after 3 consecutive failures
RESTART_COOLDOWN_SECS	600	at most one container restart per 10 minutes
FAILS_TO_REBOOT	15	reboot host after ~15 minutes of continuous failures
POST_RESTART_GRACE_SECS	300	wait 5 minutes after container restart before allowing reboot
REBOOT_COOLDOWN_SECS	3600	at most one reboot per 60 minutes

Verification / logs:

systemctl list-timers --all | grep jf-ha-healthcheck
journalctl -t jf_ha_healthcheck --no-pager | tail -n 80

4. Test result

A soft reboot was performed and the system returned to:

/ mounted RW

Tailscale up and reachable

Docker active

HA + Mosquitto containers running

5. Notes / items discussed but not implemented in this note

Subnet routing via Teltonika (RUT240) to provide out-of-band Tailscale access even if Pi-Tailscale fails — recommended, but not evidenced as implemented here.

Root-cause analysis of recurring RO boot events (USB bridge/cable/power/storage) remains separate work; this set of changes improves recovery behaviour and reduces remote lockout risk.

Appendix A — Quick Audit Commands


Root status

findmnt / -o SOURCE,FSTYPE,OPTIONS

Core services

systemctl is-active jf-remount-root-rw
systemctl is-active docker
systemctl is-active tailscaled

Containers

docker ps
docker inspect -f '{{.Name}} -> {{.HostConfig.RestartPolicy.Name}}' homeassistant mosquitto

Tailscale DNS preference

sudo tailscale set --accept-dns=false
ls -l /etc/resolv.conf
cat /etc/resolv.conf

Watchdog

ls -l /dev/watchdog*
systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec
dmesg | grep -i watchdog | tail -n 30

HA healthcheck

systemctl list-timers --all | grep jf-ha-healthcheck
journalctl -t jf_ha_healthcheck --no-pager | tail -n 60

<< Raspberry Pi 4 Configuration Overview | | Remote Recovery Run Sheet >> |Table of Contents>

Page last modified on February 23, 2026, at 12:41 pm